Descriptive and prescriptive data cleaning

Anup Chalamalla, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti

Research output: Chapter in Book/Report/Conference proceedingConference contribution

40 Scopus citations

Abstract

Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some target report generated by transformations over multiple data sources. This creates a situation where the violations detected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possible and affordable, this would be of little help towards identifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we propose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and efficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules.

Original languageEnglish (US)
Title of host publicationSIGMOD 2014 - Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages445-456
Number of pages12
ISBN (Print)9781450323765
DOIs
StatePublished - 2014
Event2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014 - Snowbird, UT, United States
Duration: Jun 22 2014Jun 27 2014

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014
Country/TerritoryUnited States
CitySnowbird, UT
Period6/22/146/27/14

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'Descriptive and prescriptive data cleaning'. Together they form a unique fingerprint.

Cite this