BayesWipe: A multimodal system for data cleaning and consistent query answering on structured bigdata

Sushovan De, Yuheng Hu, Yi Chen, Subbarao Kambhampati

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this paper, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.

Original languageEnglish (US)
Title of host publicationProceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages15-24
Number of pages10
ISBN (Print)9781479956654
DOIs
StatePublished - Jan 7 2015
Event2nd IEEE International Conference on Big Data, IEEE Big Data 2014 - Washington, United States
Duration: Oct 27 2014Oct 30 2014

Other

Other2nd IEEE International Conference on Big Data, IEEE Big Data 2014
CountryUnited States
CityWashington
Period10/27/1410/30/14

Fingerprint

Cleaning
Standardization
Computational fluid dynamics
Repair
Costs

Keywords

  • Data cleaning
  • Databases
  • Query rewriting
  • Uncertainty
  • Web databases

ASJC Scopus subject areas

  • Artificial Intelligence
  • Information Systems

Cite this

De, S., Hu, Y., Chen, Y., & Kambhampati, S. (2015). BayesWipe: A multimodal system for data cleaning and consistent query answering on structured bigdata. In Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014 (pp. 15-24). [7004207] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigData.2014.7004207

BayesWipe : A multimodal system for data cleaning and consistent query answering on structured bigdata. / De, Sushovan; Hu, Yuheng; Chen, Yi; Kambhampati, Subbarao.

Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. Institute of Electrical and Electronics Engineers Inc., 2015. p. 15-24 7004207.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

De, S, Hu, Y, Chen, Y & Kambhampati, S 2015, BayesWipe: A multimodal system for data cleaning and consistent query answering on structured bigdata. in Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014., 7004207, Institute of Electrical and Electronics Engineers Inc., pp. 15-24, 2nd IEEE International Conference on Big Data, IEEE Big Data 2014, Washington, United States, 10/27/14. https://doi.org/10.1109/BigData.2014.7004207
De S, Hu Y, Chen Y, Kambhampati S. BayesWipe: A multimodal system for data cleaning and consistent query answering on structured bigdata. In Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. Institute of Electrical and Electronics Engineers Inc. 2015. p. 15-24. 7004207 https://doi.org/10.1109/BigData.2014.7004207
De, Sushovan ; Hu, Yuheng ; Chen, Yi ; Kambhampati, Subbarao. / BayesWipe : A multimodal system for data cleaning and consistent query answering on structured bigdata. Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. Institute of Electrical and Electronics Engineers Inc., 2015. pp. 15-24
@inproceedings{a3bb99723f6d4de2bec94c2f7791022c,
title = "BayesWipe: A multimodal system for data cleaning and consistent query answering on structured bigdata",
abstract = "Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this paper, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.",
keywords = "Data cleaning, Databases, Query rewriting, Uncertainty, Web databases",
author = "Sushovan De and Yuheng Hu and Yi Chen and Subbarao Kambhampati",
year = "2015",
month = "1",
day = "7",
doi = "10.1109/BigData.2014.7004207",
language = "English (US)",
isbn = "9781479956654",
pages = "15--24",
booktitle = "Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - BayesWipe

T2 - A multimodal system for data cleaning and consistent query answering on structured bigdata

AU - De, Sushovan

AU - Hu, Yuheng

AU - Chen, Yi

AU - Kambhampati, Subbarao

PY - 2015/1/7

Y1 - 2015/1/7

N2 - Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this paper, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.

AB - Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this paper, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.

KW - Data cleaning

KW - Databases

KW - Query rewriting

KW - Uncertainty

KW - Web databases

UR - http://www.scopus.com/inward/record.url?scp=84921752480&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84921752480&partnerID=8YFLogxK

U2 - 10.1109/BigData.2014.7004207

DO - 10.1109/BigData.2014.7004207

M3 - Conference contribution

AN - SCOPUS:84921752480

SN - 9781479956654

SP - 15

EP - 24

BT - Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

PB - Institute of Electrical and Electronics Engineers Inc.

ER -