Messing up with BART

Error generation for evaluating data-cleaning algorithms

Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, Donatello Santoro

Research output: Contribution to journalArticle

18 Citations (Scopus)

Abstract

We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

Original languageEnglish (US)
Pages (from-to)36-47
Number of pages12
JournalUnknown Journal
Volume9
Issue number2
StatePublished - 2016
Externally publishedYes

Fingerprint

Cleaning
Databases
Benchmarking
Scalability
Data Accuracy

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Arocena, P. C., Glavic, B., Mecca, G., Miller, R. J., Papotti, P., & Santoro, D. (2016). Messing up with BART: Error generation for evaluating data-cleaning algorithms. Unknown Journal, 9(2), 36-47.

Messing up with BART : Error generation for evaluating data-cleaning algorithms. / Arocena, Patricia C.; Glavic, Boris; Mecca, Giansalvatore; Miller, Renée J.; Papotti, Paolo; Santoro, Donatello.

In: Unknown Journal, Vol. 9, No. 2, 2016, p. 36-47.

Research output: Contribution to journalArticle

Arocena, PC, Glavic, B, Mecca, G, Miller, RJ, Papotti, P & Santoro, D 2016, 'Messing up with BART: Error generation for evaluating data-cleaning algorithms', Unknown Journal, vol. 9, no. 2, pp. 36-47.
Arocena PC, Glavic B, Mecca G, Miller RJ, Papotti P, Santoro D. Messing up with BART: Error generation for evaluating data-cleaning algorithms. Unknown Journal. 2016;9(2):36-47.
Arocena, Patricia C. ; Glavic, Boris ; Mecca, Giansalvatore ; Miller, Renée J. ; Papotti, Paolo ; Santoro, Donatello. / Messing up with BART : Error generation for evaluating data-cleaning algorithms. In: Unknown Journal. 2016 ; Vol. 9, No. 2. pp. 36-47.
@article{dfeef2e9358e4dd19f5dc52244546d3b,
title = "Messing up with BART: Error generation for evaluating data-cleaning algorithms",
abstract = "We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.",
author = "Arocena, {Patricia C.} and Boris Glavic and Giansalvatore Mecca and Miller, {Ren{\'e}e J.} and Paolo Papotti and Donatello Santoro",
year = "2016",
language = "English (US)",
volume = "9",
pages = "36--47",
journal = "Scanning Electron Microscopy",
issn = "0586-5581",
publisher = "Scanning Microscopy International",
number = "2",

}

TY - JOUR

T1 - Messing up with BART

T2 - Error generation for evaluating data-cleaning algorithms

AU - Arocena, Patricia C.

AU - Glavic, Boris

AU - Mecca, Giansalvatore

AU - Miller, Renée J.

AU - Papotti, Paolo

AU - Santoro, Donatello

PY - 2016

Y1 - 2016

N2 - We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

AB - We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

UR - http://www.scopus.com/inward/record.url?scp=84975824359&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84975824359&partnerID=8YFLogxK

M3 - Article

VL - 9

SP - 36

EP - 47

JO - Scanning Electron Microscopy

JF - Scanning Electron Microscopy

SN - 0586-5581

IS - 2

ER -