Temporal rules discovery for web data cleaning

Ziawasch Abedjan, Cuneyt G. Akcora, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker

Research output: Chapter in Book/Report/Conference proceedingChapter

29 Scopus citations

Abstract

Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a "clean" version of the data. To support domain experts,in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques have traditionally ignored the time dimension. Recurrent events, such as persons reported in locations, have a duration in which they are valid, and this duration should be part of the rules or the cleaning process would simply fail. In this work, we study the rule discovery problem for temporal web data. Such a discovery process is challenging because of the nature of web data; extracted facts are (i) sparse over time, (ii) reported with delays, and (iii) often reported with errors over the values because of inaccurate sources or non robust extractors. We handle these challenges with a new discovery approach that is more robust to noise. Our solution uses machine learning methods, such as association measures and outlier detection, for the discovery of the rules, together with an aggressive repair of the data in the mining step itself. Our experimental evaluation over real-world data from Recorded Future, an intelligence company that monitors over 700K Web sources, shows that temporal rules improve the quality of the data with an increase of the average precision in the cleaning process from 0.37 to 0.84, and a 40% relative increase in the average F-measure.

Original languageEnglish (US)
Title of host publicationProceedings of the VLDB Endowment
PublisherAssociation for Computing Machinery
Pages336-347
Number of pages12
Volume9
Edition4
StatePublished - 2016
Externally publishedYes
Event42nd International Conference on Very Large Data Bases, VLDB 2016 - Delhi, India
Duration: Sep 5 2016Sep 9 2016

Other

Other42nd International Conference on Very Large Data Bases, VLDB 2016
CountryIndia
CityDelhi
Period9/5/169/9/16

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Temporal rules discovery for web data cleaning'. Together they form a unique fingerprint.

Cite this