Estimating data integration and cleaning effort

Sebastian Kruse, Paolo Papotti, Felix Naumann

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Data cleaning and data integration have been the topic of intensive research for at least the past thirty years, resulting in a multitude of specialized methods and integrated tool suites. All of them require at least some and in most cases significant human input in their configuration, during processing, and for evaluation. For managers (and for developers and scientists) it would be therefore of great value to be able to estimate the effort of cleaning and integrating some given data sets and to know the pitfalls of such an integration project in advance. This helps deciding about an integration project using cost/benefit analysis, budgeting a team with funds and manpower, and monitoring its progress. Further, knowledge of how well a data source fits into a given data ecosystem improves source selection. We present an extensible framework for the automatic effort estimation for mapping and cleaning activities in data integration projects with multiple sources. It comprises a set of measures and methods for estimating integration complexity and ultimately effort, taking into account heterogeneities of both schemas and instances and regarding both integration and cleaning operations. Experiments on two real-world scenarios show that our proposal is two to four times more accurate than a current approach in estimating the time duration of an integration process, and provides a meaningful breakdown of the integration problems as well as the required integration activities.

Original languageEnglish (US)
Title of host publicationEDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings
PublisherOpenProceedings.org, University of Konstanz, University Library
Pages61-72
Number of pages12
ISBN (Electronic)9783893180677
DOIs
StatePublished - 2015
Externally publishedYes
Event18th International Conference on Extending Database Technology, EDBT 2015 - Brussels, Belgium
Duration: Mar 23 2015Mar 27 2015

Other

Other18th International Conference on Extending Database Technology, EDBT 2015
CountryBelgium
CityBrussels
Period3/23/153/27/15

Fingerprint

Data integration
Cleaning
Budget control
Cost benefit analysis
Ecosystems
Managers
Monitoring
Processing
Experiments

ASJC Scopus subject areas

  • Information Systems
  • Software

Cite this

Kruse, S., Papotti, P., & Naumann, F. (2015). Estimating data integration and cleaning effort. In EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings (pp. 61-72). OpenProceedings.org, University of Konstanz, University Library. https://doi.org/10.5441/002/edbt.2015.07

Estimating data integration and cleaning effort. / Kruse, Sebastian; Papotti, Paolo; Naumann, Felix.

EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings. OpenProceedings.org, University of Konstanz, University Library, 2015. p. 61-72.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kruse, S, Papotti, P & Naumann, F 2015, Estimating data integration and cleaning effort. in EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings. OpenProceedings.org, University of Konstanz, University Library, pp. 61-72, 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, 3/23/15. https://doi.org/10.5441/002/edbt.2015.07
Kruse S, Papotti P, Naumann F. Estimating data integration and cleaning effort. In EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings. OpenProceedings.org, University of Konstanz, University Library. 2015. p. 61-72 https://doi.org/10.5441/002/edbt.2015.07
Kruse, Sebastian ; Papotti, Paolo ; Naumann, Felix. / Estimating data integration and cleaning effort. EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings. OpenProceedings.org, University of Konstanz, University Library, 2015. pp. 61-72
@inproceedings{63b0734f8a674ed9a6b6145f0343d53a,
title = "Estimating data integration and cleaning effort",
abstract = "Data cleaning and data integration have been the topic of intensive research for at least the past thirty years, resulting in a multitude of specialized methods and integrated tool suites. All of them require at least some and in most cases significant human input in their configuration, during processing, and for evaluation. For managers (and for developers and scientists) it would be therefore of great value to be able to estimate the effort of cleaning and integrating some given data sets and to know the pitfalls of such an integration project in advance. This helps deciding about an integration project using cost/benefit analysis, budgeting a team with funds and manpower, and monitoring its progress. Further, knowledge of how well a data source fits into a given data ecosystem improves source selection. We present an extensible framework for the automatic effort estimation for mapping and cleaning activities in data integration projects with multiple sources. It comprises a set of measures and methods for estimating integration complexity and ultimately effort, taking into account heterogeneities of both schemas and instances and regarding both integration and cleaning operations. Experiments on two real-world scenarios show that our proposal is two to four times more accurate than a current approach in estimating the time duration of an integration process, and provides a meaningful breakdown of the integration problems as well as the required integration activities.",
author = "Sebastian Kruse and Paolo Papotti and Felix Naumann",
year = "2015",
doi = "10.5441/002/edbt.2015.07",
language = "English (US)",
pages = "61--72",
booktitle = "EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings",
publisher = "OpenProceedings.org, University of Konstanz, University Library",

}

TY - GEN

T1 - Estimating data integration and cleaning effort

AU - Kruse, Sebastian

AU - Papotti, Paolo

AU - Naumann, Felix

PY - 2015

Y1 - 2015

N2 - Data cleaning and data integration have been the topic of intensive research for at least the past thirty years, resulting in a multitude of specialized methods and integrated tool suites. All of them require at least some and in most cases significant human input in their configuration, during processing, and for evaluation. For managers (and for developers and scientists) it would be therefore of great value to be able to estimate the effort of cleaning and integrating some given data sets and to know the pitfalls of such an integration project in advance. This helps deciding about an integration project using cost/benefit analysis, budgeting a team with funds and manpower, and monitoring its progress. Further, knowledge of how well a data source fits into a given data ecosystem improves source selection. We present an extensible framework for the automatic effort estimation for mapping and cleaning activities in data integration projects with multiple sources. It comprises a set of measures and methods for estimating integration complexity and ultimately effort, taking into account heterogeneities of both schemas and instances and regarding both integration and cleaning operations. Experiments on two real-world scenarios show that our proposal is two to four times more accurate than a current approach in estimating the time duration of an integration process, and provides a meaningful breakdown of the integration problems as well as the required integration activities.

AB - Data cleaning and data integration have been the topic of intensive research for at least the past thirty years, resulting in a multitude of specialized methods and integrated tool suites. All of them require at least some and in most cases significant human input in their configuration, during processing, and for evaluation. For managers (and for developers and scientists) it would be therefore of great value to be able to estimate the effort of cleaning and integrating some given data sets and to know the pitfalls of such an integration project in advance. This helps deciding about an integration project using cost/benefit analysis, budgeting a team with funds and manpower, and monitoring its progress. Further, knowledge of how well a data source fits into a given data ecosystem improves source selection. We present an extensible framework for the automatic effort estimation for mapping and cleaning activities in data integration projects with multiple sources. It comprises a set of measures and methods for estimating integration complexity and ultimately effort, taking into account heterogeneities of both schemas and instances and regarding both integration and cleaning operations. Experiments on two real-world scenarios show that our proposal is two to four times more accurate than a current approach in estimating the time duration of an integration process, and provides a meaningful breakdown of the integration problems as well as the required integration activities.

UR - http://www.scopus.com/inward/record.url?scp=84976312779&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84976312779&partnerID=8YFLogxK

U2 - 10.5441/002/edbt.2015.07

DO - 10.5441/002/edbt.2015.07

M3 - Conference contribution

SP - 61

EP - 72

BT - EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings

PB - OpenProceedings.org, University of Konstanz, University Library

ER -