Detecting data errors: Where are we and what needs to be done?

Ziawasch Abedjan; Xu Chu; Dong Deng; Raul Castro Fernandez; Ihab F. Ilyas; Mourad Ouzzani; Paolo Papotti; Michael Stonebraker; Nan Tang

doi:10.14778/2994509.2994518

Detecting data errors: Where are we and what needs to be done?

Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Nan Tang

Computing and Augmented Intelligence, School of (IAFSE-SCAI)

Research output: Contribution to journal › Conference article › peer-review

143 Scopus citations

Abstract

Data cleaning has played a critical role in ensuring data quality for enterprise applications. Naturally, there has been extensive research in this area, and many data cleaning algorithms have been translated into tools to detect and to possibly repair certain classes of errors such as outliers, duplicates, missing values, and violations of integrity constraints. Since different types of errors may coexist in the same data set, we often need to run more than one kind of tool. In this paper, we investigate two pragmatic questions: (1) are these tools robust enough to capture most errors in real-world data sets? and (2) what is the best strategy to holistically run multiple tools to optimize the detection effort? To answer these two questions, we obtained multiple data cleaning tools that utilize a variety of error detection techniques. We also collected five real-world data sets, for which we could obtain both the raw data and the ground truth on existing errors. In this paper, we report our experimental findings on the errors detected by the tools we tested. First, we show that the coverage of each tool is well below 100%. Second, we show that the order in which multiple tools are run makes a big difference. Hence, we propose a holistic multi-tool strategy that orders the invocations of the available tools to maximize their benefit, while minimizing human effort in verifying results. Third, since this holistic approach still does not lead to acceptable error coverage, we discuss two simple strategies that have the potential to improve the situation, namely domain specific tools and data enrichment. We close this paper by reasoning about the errors that are not detectable by any of the tools we tested.

Original language	English (US)
Pages (from-to)	993-1004
Number of pages	12
Journal	Proceedings of the VLDB Endowment
Volume	9
Issue number	12
DOIs	https://doi.org/10.14778/2994509.2994518
State	Published - 2016
Event	42nd International Conference on Very Large Data Bases, VLDB 2016 - New Delhi, India Duration: Sep 5 2016 → Sep 9 2016

ASJC Scopus subject areas

Computer Science (miscellaneous)
General Computer Science

Access to Document

10.14778/2994509.2994518

Cite this

@article{1585fa9bdbc94bcd913ba5c5454f482a,

title = "Detecting data errors: Where are we and what needs to be done?",

abstract = "Data cleaning has played a critical role in ensuring data quality for enterprise applications. Naturally, there has been extensive research in this area, and many data cleaning algorithms have been translated into tools to detect and to possibly repair certain classes of errors such as outliers, duplicates, missing values, and violations of integrity constraints. Since different types of errors may coexist in the same data set, we often need to run more than one kind of tool. In this paper, we investigate two pragmatic questions: (1) are these tools robust enough to capture most errors in real-world data sets? and (2) what is the best strategy to holistically run multiple tools to optimize the detection effort? To answer these two questions, we obtained multiple data cleaning tools that utilize a variety of error detection techniques. We also collected five real-world data sets, for which we could obtain both the raw data and the ground truth on existing errors. In this paper, we report our experimental findings on the errors detected by the tools we tested. First, we show that the coverage of each tool is well below 100%. Second, we show that the order in which multiple tools are run makes a big difference. Hence, we propose a holistic multi-tool strategy that orders the invocations of the available tools to maximize their benefit, while minimizing human effort in verifying results. Third, since this holistic approach still does not lead to acceptable error coverage, we discuss two simple strategies that have the potential to improve the situation, namely domain specific tools and data enrichment. We close this paper by reasoning about the errors that are not detectable by any of the tools we tested.",

author = "Ziawasch Abedjan and Xu Chu and Dong Deng and Fernandez, {Raul Castro} and Ilyas, {Ihab F.} and Mourad Ouzzani and Paolo Papotti and Michael Stonebraker and Nan Tang",

note = "Publisher Copyright: {\textcopyright} 2016 VLDB Endowment 2150-8097/16/08.; 42nd International Conference on Very Large Data Bases, VLDB 2016 ; Conference date: 05-09-2016 Through 09-09-2016",

year = "2016",

doi = "10.14778/2994509.2994518",

language = "English (US)",

volume = "9",

pages = "993--1004",

journal = "Proceedings of the VLDB Endowment",

issn = "2150-8097",

publisher = "Very Large Data Base Endowment Inc.",

number = "12",

}

TY - JOUR

T1 - Detecting data errors

T2 - 42nd International Conference on Very Large Data Bases, VLDB 2016

AU - Abedjan, Ziawasch

AU - Chu, Xu

AU - Deng, Dong

AU - Fernandez, Raul Castro

AU - Ilyas, Ihab F.

AU - Ouzzani, Mourad

AU - Papotti, Paolo

AU - Stonebraker, Michael

AU - Tang, Nan

PY - 2016

Y1 - 2016

N2 - Data cleaning has played a critical role in ensuring data quality for enterprise applications. Naturally, there has been extensive research in this area, and many data cleaning algorithms have been translated into tools to detect and to possibly repair certain classes of errors such as outliers, duplicates, missing values, and violations of integrity constraints. Since different types of errors may coexist in the same data set, we often need to run more than one kind of tool. In this paper, we investigate two pragmatic questions: (1) are these tools robust enough to capture most errors in real-world data sets? and (2) what is the best strategy to holistically run multiple tools to optimize the detection effort? To answer these two questions, we obtained multiple data cleaning tools that utilize a variety of error detection techniques. We also collected five real-world data sets, for which we could obtain both the raw data and the ground truth on existing errors. In this paper, we report our experimental findings on the errors detected by the tools we tested. First, we show that the coverage of each tool is well below 100%. Second, we show that the order in which multiple tools are run makes a big difference. Hence, we propose a holistic multi-tool strategy that orders the invocations of the available tools to maximize their benefit, while minimizing human effort in verifying results. Third, since this holistic approach still does not lead to acceptable error coverage, we discuss two simple strategies that have the potential to improve the situation, namely domain specific tools and data enrichment. We close this paper by reasoning about the errors that are not detectable by any of the tools we tested.

AB - Data cleaning has played a critical role in ensuring data quality for enterprise applications. Naturally, there has been extensive research in this area, and many data cleaning algorithms have been translated into tools to detect and to possibly repair certain classes of errors such as outliers, duplicates, missing values, and violations of integrity constraints. Since different types of errors may coexist in the same data set, we often need to run more than one kind of tool. In this paper, we investigate two pragmatic questions: (1) are these tools robust enough to capture most errors in real-world data sets? and (2) what is the best strategy to holistically run multiple tools to optimize the detection effort? To answer these two questions, we obtained multiple data cleaning tools that utilize a variety of error detection techniques. We also collected five real-world data sets, for which we could obtain both the raw data and the ground truth on existing errors. In this paper, we report our experimental findings on the errors detected by the tools we tested. First, we show that the coverage of each tool is well below 100%. Second, we show that the order in which multiple tools are run makes a big difference. Hence, we propose a holistic multi-tool strategy that orders the invocations of the available tools to maximize their benefit, while minimizing human effort in verifying results. Third, since this holistic approach still does not lead to acceptable error coverage, we discuss two simple strategies that have the potential to improve the situation, namely domain specific tools and data enrichment. We close this paper by reasoning about the errors that are not detectable by any of the tools we tested.

UR - http://www.scopus.com/inward/record.url?scp=85013662261&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85013662261&partnerID=8YFLogxK

U2 - 10.14778/2994509.2994518

DO - 10.14778/2994509.2994518

M3 - Conference article

AN - SCOPUS:85013662261

SN - 2150-8097

VL - 9

SP - 993

EP - 1004

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 12

Y2 - 5 September 2016 through 9 September 2016

ER -

Detecting data errors: Where are we and what needs to be done?

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this