z-logo
open-access-imgOpen Access
Automatic Module Detection in Data Cleaning Workflows: Enabling Transparency and Recipe Reuse
Author(s) -
Lan Li,
Nikolaus Nova Parulian,
Bertram Ludäscher
Publication year - 2022
Publication title -
international journal of digital curation
Language(s) - English
Resource type - Journals
ISSN - 1746-8256
DOI - 10.2218/ijdc.v16i1.771
Subject(s) - workflow , computer science , dataflow , reuse , transparency (behavior) , reusability , recipe , software engineering , database , programming language , software , engineering , chemistry , computer security , food science , waste management
Before data from multiple sources can be analyzed, data cleaning workflows (“recipes”) usually need to be employed to improve data quality. We identify a number of technical problems that make application of FAIR principles to data cleaning recipes challenging. We then demonstrate how transparency and reusability of recipes can be improved by analyzing dataflow dependencies within recipes. In particular column-level dependencies can be used to automatically detect independent subworkflows, which then can be reused individually as data cleaning modules. We have prototypically implemented this approach as part of an ongoing project to develop open-source companion tools for OpenRefine. Keywords: Data Cleaning, Provenance, Workflow Analysis

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here