Effectively mining and using coverage and overlap statistics for data integration

Zaiqing Nie; Subbarao Kambhampati; Ullas Nambiar

doi:10.1109/TKDE.2005.76

Effectively mining and using coverage and overlap statistics for data integration

Zaiqing Nie, Subbarao Kambhampati, Ullas Nambiar

Research output: Contribution to journal › Article › peer-review

13 Scopus citations

Abstract

Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition, there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. In this paper, we present a set of connected techniques that estimate the coverage and overlap statistics, while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries and threshold-based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method, and present experimental results demonstrating the efficiency of the learning algorithms and the effectiveness of the learned statistics over both controlled data sources and in the context of BibFinder with autonomous online sources.

Original language	English (US)
Pages (from-to)	638-651
Number of pages	14
Journal	IEEE Transactions on Knowledge and Data Engineering
Volume	17
Issue number	5
DOIs	https://doi.org/10.1109/TKDE.2005.76
State	Published - May 2005

Keywords

Association rule mining
Coverage and overlap statistics
Query optimization for data integration

ASJC Scopus subject areas

Information Systems
Computer Science Applications
Computational Theory and Mathematics

Access to Document

10.1109/TKDE.2005.76

Cite this

@article{40c0cfe5c30443089e4a325557782963,

title = "Effectively mining and using coverage and overlap statistics for data integration",

abstract = "Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition, there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. In this paper, we present a set of connected techniques that estimate the coverage and overlap statistics, while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries and threshold-based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method, and present experimental results demonstrating the efficiency of the learning algorithms and the effectiveness of the learned statistics over both controlled data sources and in the context of BibFinder with autonomous online sources.",

keywords = "Association rule mining, Coverage and overlap statistics, Query optimization for data integration",

author = "Zaiqing Nie and Subbarao Kambhampati and Ullas Nambiar",

note = "Funding Information: This research was supported in part by US National Science Foundation grant IRI-9801676 and Arizona State University Prop. 301 grant ECR A601 (to ET-I3). Preliminary versions of this work have been presented at Proc. Third Int{\textquoteright}l Workshop Web Information and Data Management (WIDM) 2001 [22] and Proc. ACM Conf. Information and Knowledge Management (CIKM) 2002 [23].",

year = "2005",

month = may,

doi = "10.1109/TKDE.2005.76",

language = "English (US)",

volume = "17",

pages = "638--651",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society",

number = "5",

}

TY - JOUR

T1 - Effectively mining and using coverage and overlap statistics for data integration

AU - Nie, Zaiqing

AU - Kambhampati, Subbarao

AU - Nambiar, Ullas

N1 - Funding Information: This research was supported in part by US National Science Foundation grant IRI-9801676 and Arizona State University Prop. 301 grant ECR A601 (to ET-I3). Preliminary versions of this work have been presented at Proc. Third Int’l Workshop Web Information and Data Management (WIDM) 2001 [22] and Proc. ACM Conf. Information and Knowledge Management (CIKM) 2002 [23].

PY - 2005/5

Y1 - 2005/5

N2 - Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition, there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. In this paper, we present a set of connected techniques that estimate the coverage and overlap statistics, while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries and threshold-based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method, and present experimental results demonstrating the efficiency of the learning algorithms and the effectiveness of the learned statistics over both controlled data sources and in the context of BibFinder with autonomous online sources.

AB - Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition, there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. In this paper, we present a set of connected techniques that estimate the coverage and overlap statistics, while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries and threshold-based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method, and present experimental results demonstrating the efficiency of the learning algorithms and the effectiveness of the learned statistics over both controlled data sources and in the context of BibFinder with autonomous online sources.

KW - Association rule mining

KW - Coverage and overlap statistics

KW - Query optimization for data integration

UR - http://www.scopus.com/inward/record.url?scp=19944371158&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=19944371158&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2005.76

DO - 10.1109/TKDE.2005.76

M3 - Article

AN - SCOPUS:19944371158

SN - 1041-4347

VL - 17

SP - 638

EP - 651

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 5

ER -

Effectively mining and using coverage and overlap statistics for data integration

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this