BigDansing: A system for big data cleansing

Zuhair Khayyaty; Ihab F. Ilyasz; Alekh Jindal; Samuel Madden; Mourad Ouzzani; Paolo Papotti; Jorge Arnulfo Quiané-Ruiz; Nan Tang; Si Yin

doi:10.1145/2723372.2747646

BigDansing: A system for big data cleansing

Zuhair Khayyaty, Ihab F. Ilyasz, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge Arnulfo Quiané-Ruiz, Nan Tang, Si Yin

Computing and Augmented Intelligence, School of (IAFSE-SCAI)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

93 Scopus citations

Abstract

Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that Big-Dansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.

Original language	English (US)
Title of host publication	SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
Publisher	Association for Computing Machinery
Pages	1215-1230
Number of pages	16
ISBN (Electronic)	9781450327589
DOIs	https://doi.org/10.1145/2723372.2747646
State	Published - May 27 2015
Event	ACM SIGMOD International Conference on Management of Data, SIGMOD 2015 - Melbourne, Australia Duration: May 31 2015 → Jun 4 2015

Publication series

Name	Proceedings of the ACM SIGMOD International Conference on Management of Data
Volume	2015-May
ISSN (Print)	0730-8078

Other

Other	ACM SIGMOD International Conference on Management of Data, SIGMOD 2015
Country/Territory	Australia
City	Melbourne
Period	5/31/15 → 6/4/15

ASJC Scopus subject areas

Software
Information Systems

Access to Document

10.1145/2723372.2747646

Cite this

Khayyaty, Z., Ilyasz, I. F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J. A., Tang, N., & Yin, S. (2015). BigDansing: A system for big data cleansing. In SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1215-1230). (Proceedings of the ACM SIGMOD International Conference on Management of Data; Vol. 2015-May). Association for Computing Machinery. https://doi.org/10.1145/2723372.2747646

BigDansing: A system for big data cleansing. / Khayyaty, Zuhair; Ilyasz, Ihab F.; Jindal, Alekh et al.
SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, 2015. p. 1215-1230 (Proceedings of the ACM SIGMOD International Conference on Management of Data; Vol. 2015-May).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Khayyaty, Z, Ilyasz, IF, Jindal, A, Madden, S, Ouzzani, M, Papotti, P, Quiané-Ruiz, JA, Tang, N & Yin, S 2015, BigDansing: A system for big data cleansing. in SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. Proceedings of the ACM SIGMOD International Conference on Management of Data, vol. 2015-May, Association for Computing Machinery, pp. 1215-1230, ACM SIGMOD International Conference on Management of Data, SIGMOD 2015, Melbourne, Australia, 5/31/15. https://doi.org/10.1145/2723372.2747646

Khayyaty Z, Ilyasz IF, Jindal A, Madden S, Ouzzani M, Papotti P et al. BigDansing: A system for big data cleansing. In SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery. 2015. p. 1215-1230. (Proceedings of the ACM SIGMOD International Conference on Management of Data). doi: 10.1145/2723372.2747646

@inproceedings{f7d28d0276864ead988b544958c31beb,

title = "BigDansing: A system for big data cleansing",

abstract = "Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that Big-Dansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.",

author = "Zuhair Khayyaty and Ilyasz, {Ihab F.} and Alekh Jindal and Samuel Madden and Mourad Ouzzani and Paolo Papotti and Quian{\'e}-Ruiz, {Jorge Arnulfo} and Nan Tang and Si Yin",

note = "Publisher Copyright: Copyright {\textcopyright} 2015 ACM.; ACM SIGMOD International Conference on Management of Data, SIGMOD 2015 ; Conference date: 31-05-2015 Through 04-06-2015",

year = "2015",

month = may,

day = "27",

doi = "10.1145/2723372.2747646",

language = "English (US)",

series = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

publisher = "Association for Computing Machinery",

pages = "1215--1230",

booktitle = "SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data",

}

TY - GEN

T1 - BigDansing

T2 - ACM SIGMOD International Conference on Management of Data, SIGMOD 2015

AU - Khayyaty, Zuhair

AU - Ilyasz, Ihab F.

AU - Jindal, Alekh

AU - Madden, Samuel

AU - Ouzzani, Mourad

AU - Papotti, Paolo

AU - Quiané-Ruiz, Jorge Arnulfo

AU - Tang, Nan

AU - Yin, Si

PY - 2015/5/27

Y1 - 2015/5/27

N2 - Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that Big-Dansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.

AB - Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that Big-Dansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.

UR - http://www.scopus.com/inward/record.url?scp=84949872769&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84949872769&partnerID=8YFLogxK

U2 - 10.1145/2723372.2747646

DO - 10.1145/2723372.2747646

M3 - Conference contribution

AN - SCOPUS:84949872769

T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

SP - 1215

EP - 1230

BT - SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

PB - Association for Computing Machinery

Y2 - 31 May 2015 through 4 June 2015

ER -

BigDansing: A system for big data cleansing

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this