TY - JOUR
T1 - Optimizing Recursive Information Gathering Plans in EMERAC
AU - Kambhampatt, Subbarao
AU - Lambrecht, Eric
AU - Nambiar, Ullas
AU - Nie, Zaiqing
AU - Senthil, Gnanaprakasam
N1 - Funding Information:
∗This research is supported in part by NSF young investigator award (NYI) IRI-9457634, Army AASERT grant DAAH04-96-1-0247, and NSF grant IRI-9801676. We thank Selc¸uk Candan for many helpful comments. Preliminary versions of parts of this work have been presented at IJCAI (Lambrecht et al., 1999), and workshops on Intelligent Information Integration (Kambhampati and Gnanaprakasam, 1999; Lambrecht and Kambhampati, 1998). †Author to whom all correspondence should be addressed.
PY - 2004/3
Y1 - 2004/3
N2 - In this paper we describe two optimization techniques that are specially tailored for information gathering. The first is a greedy minimization algorithm that minimizes an information gathering plan by removing redundant and overlapping information sources without loss of completeness. We then discuss a set of heuristics that guide the greedy minimization algorithm so as to remove costlier information sources first. In contrast to previous work, our approach can handle recursive query plans that arise commonly in the presence of constrained sources. Second, we present a method for ordering the access to sources to reduce the execution cost. This problem differs significantly from the traditional database query optimization problem as sources on the Internet have a variety of access limitations and the execution cost in information gathering is affected both by network traffic and by the connection setup costs. Furthermore, because of the autonomous and decentralized nature of the Web, very little cost statistics about the sources may be available. In this paper, we propose a heuristic algorithm for ordering source calls that takes these constraints into account. Specifically, our algorithm takes both access costs and traffic costs into account, and is able to operate with very coarse statistics about sources (i.e., without depending on full source statistics). Finally, we will discuss implementation and empirical evaluation of these methods in Emerac, our prototype information gathering system.
AB - In this paper we describe two optimization techniques that are specially tailored for information gathering. The first is a greedy minimization algorithm that minimizes an information gathering plan by removing redundant and overlapping information sources without loss of completeness. We then discuss a set of heuristics that guide the greedy minimization algorithm so as to remove costlier information sources first. In contrast to previous work, our approach can handle recursive query plans that arise commonly in the presence of constrained sources. Second, we present a method for ordering the access to sources to reduce the execution cost. This problem differs significantly from the traditional database query optimization problem as sources on the Internet have a variety of access limitations and the execution cost in information gathering is affected both by network traffic and by the connection setup costs. Furthermore, because of the autonomous and decentralized nature of the Web, very little cost statistics about the sources may be available. In this paper, we propose a heuristic algorithm for ordering source calls that takes these constraints into account. Specifically, our algorithm takes both access costs and traffic costs into account, and is able to operate with very coarse statistics about sources (i.e., without depending on full source statistics). Finally, we will discuss implementation and empirical evaluation of these methods in Emerac, our prototype information gathering system.
KW - Data integration
KW - Information gathering
KW - Query optimization
KW - Web and databases
UR - http://www.scopus.com/inward/record.url?scp=1442284482&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=1442284482&partnerID=8YFLogxK
U2 - 10.1023/B:JIIS.0000012467.66268.9e
DO - 10.1023/B:JIIS.0000012467.66268.9e
M3 - Article
AN - SCOPUS:1442284482
SN - 0925-9902
VL - 22
SP - 119
EP - 153
JO - Journal of Intelligent Information Systems
JF - Journal of Intelligent Information Systems
IS - 2
ER -