Messing up with BART: Error generation for evaluating data-cleaning algorithms

Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, Donatello Santoro

Research output: Contribution to journalArticle

22 Scopus citations

Abstract

We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

Original languageEnglish (US)
Pages (from-to)36-47
Number of pages12
JournalUnknown Journal
Volume9
Issue number2
StatePublished - 2016
Externally publishedYes

    Fingerprint

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Arocena, P. C., Glavic, B., Mecca, G., Miller, R. J., Papotti, P., & Santoro, D. (2016). Messing up with BART: Error generation for evaluating data-cleaning algorithms. Unknown Journal, 9(2), 36-47.