TY - GEN
T1 - Mining Source Coverage Statistics for Data Integration
AU - Nie, Zaiqing
AU - Kambhampati, Subbarao
AU - Nambiar, Ullas
AU - Vaddi, Sreelakshmi
N1 - Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2001
Y1 - 2001
N2 - Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing, Despite this recognition there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. Naive approaches can become infeasible very quickly. In this paper we present a set of connected techniques that estimate the coverage and overlap statistics while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries, and threshold based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method, and present preliminary experimental results showing the feasibility of the approach.
AB - Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing, Despite this recognition there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. Naive approaches can become infeasible very quickly. In this paper we present a set of connected techniques that estimate the coverage and overlap statistics while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries, and threshold based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method, and present preliminary experimental results showing the feasibility of the approach.
UR - http://www.scopus.com/inward/record.url?scp=0141953858&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0141953858&partnerID=8YFLogxK
U2 - 10.1145/502933.502934
DO - 10.1145/502933.502934
M3 - Conference contribution
AN - SCOPUS:0141953858
SN - 1581134444
SN - 9781581134445
T3 - Proceedings of the Third International Workshop on Web Information and Data Management (WIDM)
SP - 1
EP - 8
BT - Proceedings of the Third International Workshop on Web Information and Data Management (WIDM)
A2 - Lim, E.-P.
A2 - Hsiang-Li, R.C.
PB - Association for Computing Machinery (ACM)
T2 - Proceedings of the Third International Workshop on Web Information and Data Management (WIDM)
Y2 - 9 November 2001 through 9 November 2001
ER -