SourceRank: Relevance and trust assessment for deep web sources based on inter-source agreement

Raju Balakrishnan, Subbarao Kambhampati

Research output: Chapter in Book/Report/Conference proceedingConference contribution

33 Scopus citations

Abstract

One immediate challenge in searching the deep web databases is source selection-i.e. selecting the most relevant web databases for answering a given query. The existing database selection methods (both text and relational) assess the source quality based on the query-similarity-based relevance assessment. When applied to the deep web these methods have two deficiencies. First is that the methods are agnostic to the correctness (trustworthiness) of the sources. Secondly, the query based relevance does not consider the importance of the results. These two considerations are essential for the open collections like the deep web. Since a number of sources provide answers to any query, we conjuncture that the agreements between these answers are likely to be helpful in assessing the importance and the trustworthiness of the sources. We compute the agreement between the sources as the agreement of the answers returned. While computing the agreement, we also measure and compensate for possible collusion between the sources. This adjusted agreement is modeled as a graph with sources at the vertices. On this agreement graph, a quality score of a source that we call SourceRank, is calculated as the stationary visit probability of a random walk. We evaluate SourceRank in multiple domains, including sources in Google Base, with sizes up to 675 sources. We demonstrate that the SourceRank tracks source corruption. Further, our relevance evaluations show that SourceRank improves precision by 22-60% over the Google Base and other baseline methods. SourceRank has been implemented in a system called Factal.

Original languageEnglish (US)
Title of host publicationProceedings of the 20th International Conference on World Wide Web, WWW 2011
Pages227-236
Number of pages10
DOIs
StatePublished - 2011
Event20th International Conference on World Wide Web, WWW 2011 - Hyderabad, India
Duration: Mar 28 2011Apr 1 2011

Publication series

NameProceedings of the 20th International Conference on World Wide Web, WWW 2011

Other

Other20th International Conference on World Wide Web, WWW 2011
Country/TerritoryIndia
CityHyderabad
Period3/28/114/1/11

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this