TY - GEN
T1 - Interactive and deterministic data cleaning
T2 - 2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016
AU - He, Jian
AU - Veltri, Enzo
AU - Santoro, Donatello
AU - Li, Guoliang
AU - Mecca, Giansalvatore
AU - Papotti, Paolo
AU - Tang, Nan
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/6/26
Y1 - 2016/6/26
N2 - We present Falcon, an interactive, deterministic, and declarative data cleaning system, which uses SQL update queries as the language to repair data. Falcon does not rely on the existence of a set of pre-defined data quality rules. On the contrary, it encourages users to explore the data, identify possible problems, and make updates to fix them. Bootstrapped by one user update, Falcon guesses a set of possible SQL update queries that can be used to repair the data. The main technical challenge addressed in this paper consists in finding a set of SQL update queries that is minimal in size and at the same time fixes the largest number of errors in the data. We formalize this problem as a search in a lattice-shaped space. To guarantee that the chosen updates are semantically correct, Falcon navigates the lattice by interacting with users to gradually validate the set of SQL update queries. Besides using traditional one-hop based traverse algorithms (e.g., BFS or DFS), we describe novel multi-hop search algorithms such that Falcon can dive over the lattice and conduct the search efficiently. Our novel search strategy is coupled with a number of optimization techniques to further prune the search space and efficiently maintain the lattice. We have conducted extensive experiments using both real-world and synthetic datasets to show that Falcon can effectively communicate with users in data repairing.
AB - We present Falcon, an interactive, deterministic, and declarative data cleaning system, which uses SQL update queries as the language to repair data. Falcon does not rely on the existence of a set of pre-defined data quality rules. On the contrary, it encourages users to explore the data, identify possible problems, and make updates to fix them. Bootstrapped by one user update, Falcon guesses a set of possible SQL update queries that can be used to repair the data. The main technical challenge addressed in this paper consists in finding a set of SQL update queries that is minimal in size and at the same time fixes the largest number of errors in the data. We formalize this problem as a search in a lattice-shaped space. To guarantee that the chosen updates are semantically correct, Falcon navigates the lattice by interacting with users to gradually validate the set of SQL update queries. Besides using traditional one-hop based traverse algorithms (e.g., BFS or DFS), we describe novel multi-hop search algorithms such that Falcon can dive over the lattice and conduct the search efficiently. Our novel search strategy is coupled with a number of optimization techniques to further prune the search space and efficiently maintain the lattice. We have conducted extensive experiments using both real-world and synthetic datasets to show that Falcon can effectively communicate with users in data repairing.
KW - Data cleaning
KW - Declarative
KW - Deterministic
KW - Interactive
UR - http://www.scopus.com/inward/record.url?scp=84979711032&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84979711032&partnerID=8YFLogxK
U2 - 10.1145/2882903.2915242
DO - 10.1145/2882903.2915242
M3 - Conference contribution
AN - SCOPUS:84979711032
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 893
EP - 907
BT - SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data
PB - Association for Computing Machinery
Y2 - 26 June 2016 through 1 July 2016
ER -