Estimating data integration and cleaning effort

Sebastian Kruse, Paolo Papotti, Felix Naumann

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Scopus citations

Abstract

Data cleaning and data integration have been the topic of intensive research for at least the past thirty years, resulting in a multitude of specialized methods and integrated tool suites. All of them require at least some and in most cases significant human input in their configuration, during processing, and for evaluation. For managers (and for developers and scientists) it would be therefore of great value to be able to estimate the effort of cleaning and integrating some given data sets and to know the pitfalls of such an integration project in advance. This helps deciding about an integration project using cost/benefit analysis, budgeting a team with funds and manpower, and monitoring its progress. Further, knowledge of how well a data source fits into a given data ecosystem improves source selection. We present an extensible framework for the automatic effort estimation for mapping and cleaning activities in data integration projects with multiple sources. It comprises a set of measures and methods for estimating integration complexity and ultimately effort, taking into account heterogeneities of both schemas and instances and regarding both integration and cleaning operations. Experiments on two real-world scenarios show that our proposal is two to four times more accurate than a current approach in estimating the time duration of an integration process, and provides a meaningful breakdown of the integration problems as well as the required integration activities.

Original languageEnglish (US)
Title of host publicationEDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings
PublisherOpenProceedings.org, University of Konstanz, University Library
Pages61-72
Number of pages12
ISBN (Electronic)9783893180677
DOIs
StatePublished - 2015
Externally publishedYes
Event18th International Conference on Extending Database Technology, EDBT 2015 - Brussels, Belgium
Duration: Mar 23 2015Mar 27 2015

Other

Other18th International Conference on Extending Database Technology, EDBT 2015
Country/TerritoryBelgium
CityBrussels
Period3/23/153/27/15

ASJC Scopus subject areas

  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'Estimating data integration and cleaning effort'. Together they form a unique fingerprint.

Cite this