Mining Source Coverage Statistics for Data Integration

Zaiqing Nie, Subbarao Kambhampati, Ullas Nambiar, Sreelakshmi Vaddi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Scopus citations

Abstract

Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing, Despite this recognition there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. Naive approaches can become infeasible very quickly. In this paper we present a set of connected techniques that estimate the coverage and overlap statistics while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries, and threshold based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method, and present preliminary experimental results showing the feasibility of the approach.

Original languageEnglish (US)
Title of host publicationProceedings of the Third International Workshop on Web Information and Data Management (WIDM)
EditorsE.-P. Lim, R.C. Hsiang-Li
PublisherAssociation for Computing Machinery (ACM)
Pages1-8
Number of pages8
ISBN (Print)1581134444, 9781581134445
DOIs
StatePublished - 2001
EventProceedings of the Third International Workshop on Web Information and Data Management (WIDM) - Atlanta, GA, United States
Duration: Nov 9 2001Nov 9 2001

Publication series

NameProceedings of the Third International Workshop on Web Information and Data Management (WIDM)

Conference

ConferenceProceedings of the Third International Workshop on Web Information and Data Management (WIDM)
Country/TerritoryUnited States
CityAtlanta, GA
Period11/9/0111/9/01

ASJC Scopus subject areas

  • Engineering(all)

Fingerprint

Dive into the research topics of 'Mining Source Coverage Statistics for Data Integration'. Together they form a unique fingerprint.

Cite this