TY - GEN
T1 - SourceRank
T2 - 20th International Conference on World Wide Web, WWW 2011
AU - Balakrishnan, Raju
AU - Kambhampati, Subbarao
PY - 2011
Y1 - 2011
N2 - One immediate challenge in searching the deep web databases is source selection-i.e. selecting the most relevant web databases for answering a given query. The existing database selection methods (both text and relational) assess the source quality based on the query-similarity-based relevance assessment. When applied to the deep web these methods have two deficiencies. First is that the methods are agnostic to the correctness (trustworthiness) of the sources. Secondly, the query based relevance does not consider the importance of the results. These two considerations are essential for the open collections like the deep web. Since a number of sources provide answers to any query, we conjuncture that the agreements between these answers are likely to be helpful in assessing the importance and the trustworthiness of the sources. We compute the agreement between the sources as the agreement of the answers returned. While computing the agreement, we also measure and compensate for possible collusion between the sources. This adjusted agreement is modeled as a graph with sources at the vertices. On this agreement graph, a quality score of a source that we call SourceRank, is calculated as the stationary visit probability of a random walk. We evaluate SourceRank in multiple domains, including sources in Google Base, with sizes up to 675 sources. We demonstrate that the SourceRank tracks source corruption. Further, our relevance evaluations show that SourceRank improves precision by 22-60% over the Google Base and other baseline methods. SourceRank has been implemented in a system called Factal.
AB - One immediate challenge in searching the deep web databases is source selection-i.e. selecting the most relevant web databases for answering a given query. The existing database selection methods (both text and relational) assess the source quality based on the query-similarity-based relevance assessment. When applied to the deep web these methods have two deficiencies. First is that the methods are agnostic to the correctness (trustworthiness) of the sources. Secondly, the query based relevance does not consider the importance of the results. These two considerations are essential for the open collections like the deep web. Since a number of sources provide answers to any query, we conjuncture that the agreements between these answers are likely to be helpful in assessing the importance and the trustworthiness of the sources. We compute the agreement between the sources as the agreement of the answers returned. While computing the agreement, we also measure and compensate for possible collusion between the sources. This adjusted agreement is modeled as a graph with sources at the vertices. On this agreement graph, a quality score of a source that we call SourceRank, is calculated as the stationary visit probability of a random walk. We evaluate SourceRank in multiple domains, including sources in Google Base, with sizes up to 675 sources. We demonstrate that the SourceRank tracks source corruption. Further, our relevance evaluations show that SourceRank improves precision by 22-60% over the Google Base and other baseline methods. SourceRank has been implemented in a system called Factal.
UR - http://www.scopus.com/inward/record.url?scp=84867149755&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84867149755&partnerID=8YFLogxK
U2 - 10.1145/1963405.1963440
DO - 10.1145/1963405.1963440
M3 - Conference contribution
AN - SCOPUS:84867149755
SN - 9781450306324
T3 - Proceedings of the 20th International Conference on World Wide Web, WWW 2011
SP - 227
EP - 236
BT - Proceedings of the 20th International Conference on World Wide Web, WWW 2011
Y2 - 28 March 2011 through 1 April 2011
ER -