TY - GEN
T1 - Holistic data cleaning
T2 - 29th International Conference on Data Engineering, ICDE 2013
AU - Chu, Xu
AU - Ilyas, Ihab F.
AU - Papotti, Paolo
PY - 2013/8/15
Y1 - 2013/8/15
N2 - Data cleaning is an important problem and data quality rules are the most promising way to face it with a declarative approach. Previous work has focused on specific formalisms, such as functional dependencies (FDs), conditional functional dependencies (CFDs), and matching dependencies (MDs), and those have always been studied in isolation. Moreover, such techniques are usually applied in a pipeline or interleaved. In this work we tackle the problem in a novel, unified framework. First, we let users specify quality rules using denial constraints with ad-hoc predicates. This language subsumes existing formalisms and can express rules involving numerical values, with predicates such as "greater than" and "less than". More importantly, we exploit the interaction of the heterogeneous constraints by encoding them in a conflict hypergraph. Such holistic view of the conflicts is the starting point for a novel definition of repair context which allows us to compute automatically repairs of better quality w.r.t. previous approaches in the literature. Experimental results on real datasets show that the holistic approach outperforms previous algorithms in terms of quality and efficiency of the repair.
AB - Data cleaning is an important problem and data quality rules are the most promising way to face it with a declarative approach. Previous work has focused on specific formalisms, such as functional dependencies (FDs), conditional functional dependencies (CFDs), and matching dependencies (MDs), and those have always been studied in isolation. Moreover, such techniques are usually applied in a pipeline or interleaved. In this work we tackle the problem in a novel, unified framework. First, we let users specify quality rules using denial constraints with ad-hoc predicates. This language subsumes existing formalisms and can express rules involving numerical values, with predicates such as "greater than" and "less than". More importantly, we exploit the interaction of the heterogeneous constraints by encoding them in a conflict hypergraph. Such holistic view of the conflicts is the starting point for a novel definition of repair context which allows us to compute automatically repairs of better quality w.r.t. previous approaches in the literature. Experimental results on real datasets show that the holistic approach outperforms previous algorithms in terms of quality and efficiency of the repair.
UR - http://www.scopus.com/inward/record.url?scp=84881365460&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84881365460&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2013.6544847
DO - 10.1109/ICDE.2013.6544847
M3 - Conference contribution
AN - SCOPUS:84881365460
SN - 9781467349086
T3 - Proceedings - International Conference on Data Engineering
SP - 458
EP - 469
BT - ICDE 2013 - 29th International Conference on Data Engineering
Y2 - 8 April 2013 through 11 April 2013
ER -