BayesWipe: A scalable probabilistic framework for improving data quality

Sushovan De; Yuheng Hu; Venkata Vamsikrishna Meduri; Yi Chen; Subbarao Kambhampati

doi:10.1145/2992787

BayesWipe: A scalable probabilistic framework for improving data quality

Sushovan De, Yuheng Hu, Venkata Vamsikrishna Meduri, Yi Chen, Subbarao Kambhampati

Research output: Contribution to journal › Article › peer-review

14 Scopus citations

Abstract

Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like Conditional Functional Dependencies (which have to be provided by domain experts or learned from a clean sample of the database). In this article, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.

Original language	English (US)
Article number	5
Journal	Journal of Data and Information Quality
Volume	8
Issue number	1
DOIs	https://doi.org/10.1145/2992787
State	Published - Oct 2016

Keywords

Data quality
Offline and online cleaning
Statistical data cleaning

ASJC Scopus subject areas

Information Systems
Information Systems and Management

Access to Document

10.1145/2992787

Cite this

@article{e2efa7cd456d46f1ab5312a4786b1a4b,

title = "BayesWipe: A scalable probabilistic framework for improving data quality",

abstract = "Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like Conditional Functional Dependencies (which have to be provided by domain experts or learned from a clean sample of the database). In this article, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.",

keywords = "Data quality, Offline and online cleaning, Statistical data cleaning",

author = "Sushovan De and Yuheng Hu and Meduri, {Venkata Vamsikrishna} and Yi Chen and Subbarao Kambhampati",

note = "Funding Information: This research was supported in part by the ONR grants N00014-13-1-0176, N0014-13-1-0519, ARO grant W911NF-13-1-0023, NSF CAREER award #1322406, a Google Research Grant, and an endowment from the Leir Charitable Foundations. This work was done when the authors were with the Department of Computer Science & Engineering at Arizona State University. Sushovan De is now at Google. Publisher Copyright: {\textcopyright} 2016 ACM.",

year = "2016",

month = oct,

doi = "10.1145/2992787",

language = "English (US)",

volume = "8",

journal = "Journal of Data and Information Quality",

issn = "1936-1955",

publisher = "Association for Computing Machinery (ACM)",

number = "1",

}

TY - JOUR

T1 - BayesWipe

T2 - A scalable probabilistic framework for improving data quality

AU - De, Sushovan

AU - Hu, Yuheng

AU - Meduri, Venkata Vamsikrishna

AU - Chen, Yi

AU - Kambhampati, Subbarao

N1 - Funding Information: This research was supported in part by the ONR grants N00014-13-1-0176, N0014-13-1-0519, ARO grant W911NF-13-1-0023, NSF CAREER award #1322406, a Google Research Grant, and an endowment from the Leir Charitable Foundations. This work was done when the authors were with the Department of Computer Science & Engineering at Arizona State University. Sushovan De is now at Google. Publisher Copyright: © 2016 ACM.

PY - 2016/10

Y1 - 2016/10

N2 - Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like Conditional Functional Dependencies (which have to be provided by domain experts or learned from a clean sample of the database). In this article, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.

AB - Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like Conditional Functional Dependencies (which have to be provided by domain experts or learned from a clean sample of the database). In this article, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.

KW - Data quality

KW - Offline and online cleaning

KW - Statistical data cleaning

UR - http://www.scopus.com/inward/record.url?scp=84994571337&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84994571337&partnerID=8YFLogxK

U2 - 10.1145/2992787

DO - 10.1145/2992787

M3 - Article

AN - SCOPUS:84994571337

SN - 1936-1955

VL - 8

JO - Journal of Data and Information Quality

JF - Journal of Data and Information Quality

IS - 1

M1 - 5

ER -

BayesWipe: A scalable probabilistic framework for improving data quality

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this