BigDansing: A system for big data cleansing

Zuhair Khayyaty, Ihab F. Ilyasz, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge Arnulfo Quiané-Ruiz, Nan Tang, Si Yin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

93 Scopus citations

Abstract

Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that Big-Dansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.

Original languageEnglish (US)
Title of host publicationSIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1215-1230
Number of pages16
ISBN (Electronic)9781450327589
DOIs
StatePublished - May 27 2015
EventACM SIGMOD International Conference on Management of Data, SIGMOD 2015 - Melbourne, Australia
Duration: May 31 2015Jun 4 2015

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
Volume2015-May
ISSN (Print)0730-8078

Other

OtherACM SIGMOD International Conference on Management of Data, SIGMOD 2015
Country/TerritoryAustralia
CityMelbourne
Period5/31/156/4/15

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'BigDansing: A system for big data cleansing'. Together they form a unique fingerprint.

Cite this