Effectively mining and using coverage and overlap statistics for data integration

Zaiqing Nie, Subbarao Kambhampati, Ullas Nambiar

Research output: Contribution to journalArticle

12 Scopus citations

Abstract

Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition, there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. In this paper, we present a set of connected techniques that estimate the coverage and overlap statistics, while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries and threshold-based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method, and present experimental results demonstrating the efficiency of the learning algorithms and the effectiveness of the learned statistics over both controlled data sources and in the context of BibFinder with autonomous online sources.

Original languageEnglish (US)
Pages (from-to)638-651
Number of pages14
JournalIEEE Transactions on Knowledge and Data Engineering
Volume17
Issue number5
DOIs
StatePublished - May 2005

Keywords

  • Association rule mining
  • Coverage and overlap statistics
  • Query optimization for data integration

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Fingerprint Dive into the research topics of 'Effectively mining and using coverage and overlap statistics for data integration'. Together they form a unique fingerprint.

  • Cite this