SourceRank: Relevance and trust assessment for deep web sources based on inter-source agreement

Raju Balakrishnan; Subbarao Kambhampati

doi:10.1145/1963405.1963440

SourceRank: Relevance and trust assessment for deep web sources based on inter-source agreement

Raju Balakrishnan, Subbarao Kambhampati

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

33 Scopus citations

Abstract

One immediate challenge in searching the deep web databases is source selection-i.e. selecting the most relevant web databases for answering a given query. The existing database selection methods (both text and relational) assess the source quality based on the query-similarity-based relevance assessment. When applied to the deep web these methods have two deficiencies. First is that the methods are agnostic to the correctness (trustworthiness) of the sources. Secondly, the query based relevance does not consider the importance of the results. These two considerations are essential for the open collections like the deep web. Since a number of sources provide answers to any query, we conjuncture that the agreements between these answers are likely to be helpful in assessing the importance and the trustworthiness of the sources. We compute the agreement between the sources as the agreement of the answers returned. While computing the agreement, we also measure and compensate for possible collusion between the sources. This adjusted agreement is modeled as a graph with sources at the vertices. On this agreement graph, a quality score of a source that we call SourceRank, is calculated as the stationary visit probability of a random walk. We evaluate SourceRank in multiple domains, including sources in Google Base, with sizes up to 675 sources. We demonstrate that the SourceRank tracks source corruption. Further, our relevance evaluations show that SourceRank improves precision by 22-60% over the Google Base and other baseline methods. SourceRank has been implemented in a system called Factal.

Original language	English (US)
Title of host publication	Proceedings of the 20th International Conference on World Wide Web, WWW 2011
Pages	227-236
Number of pages	10
DOIs	https://doi.org/10.1145/1963405.1963440
State	Published - 2011
Event	20th International Conference on World Wide Web, WWW 2011 - Hyderabad, India Duration: Mar 28 2011 → Apr 1 2011

Publication series

Name	Proceedings of the 20th International Conference on World Wide Web, WWW 2011

Other

Other	20th International Conference on World Wide Web, WWW 2011
Country/Territory	India
City	Hyderabad
Period	3/28/11 → 4/1/11

ASJC Scopus subject areas

Computer Networks and Communications

Access to Document

10.1145/1963405.1963440

Cite this

SourceRank: Relevance and trust assessment for deep web sources based on inter-source agreement. / Balakrishnan, Raju; Kambhampati, Subbarao.
Proceedings of the 20th International Conference on World Wide Web, WWW 2011. 2011. p. 227-236 (Proceedings of the 20th International Conference on World Wide Web, WWW 2011).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Balakrishnan, R & Kambhampati, S 2011, SourceRank: Relevance and trust assessment for deep web sources based on inter-source agreement. in Proceedings of the 20th International Conference on World Wide Web, WWW 2011. Proceedings of the 20th International Conference on World Wide Web, WWW 2011, pp. 227-236, 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, 3/28/11. https://doi.org/10.1145/1963405.1963440

@inproceedings{573ba504392444d4877030710c74011c,

title = "SourceRank: Relevance and trust assessment for deep web sources based on inter-source agreement",

abstract = "One immediate challenge in searching the deep web databases is source selection-i.e. selecting the most relevant web databases for answering a given query. The existing database selection methods (both text and relational) assess the source quality based on the query-similarity-based relevance assessment. When applied to the deep web these methods have two deficiencies. First is that the methods are agnostic to the correctness (trustworthiness) of the sources. Secondly, the query based relevance does not consider the importance of the results. These two considerations are essential for the open collections like the deep web. Since a number of sources provide answers to any query, we conjuncture that the agreements between these answers are likely to be helpful in assessing the importance and the trustworthiness of the sources. We compute the agreement between the sources as the agreement of the answers returned. While computing the agreement, we also measure and compensate for possible collusion between the sources. This adjusted agreement is modeled as a graph with sources at the vertices. On this agreement graph, a quality score of a source that we call SourceRank, is calculated as the stationary visit probability of a random walk. We evaluate SourceRank in multiple domains, including sources in Google Base, with sizes up to 675 sources. We demonstrate that the SourceRank tracks source corruption. Further, our relevance evaluations show that SourceRank improves precision by 22-60% over the Google Base and other baseline methods. SourceRank has been implemented in a system called Factal.",

author = "Raju Balakrishnan and Subbarao Kambhampati",

year = "2011",

doi = "10.1145/1963405.1963440",

language = "English (US)",

isbn = "9781450306324",

series = "Proceedings of the 20th International Conference on World Wide Web, WWW 2011",

pages = "227--236",

booktitle = "Proceedings of the 20th International Conference on World Wide Web, WWW 2011",

note = "20th International Conference on World Wide Web, WWW 2011 ; Conference date: 28-03-2011 Through 01-04-2011",

}

TY - GEN

T1 - SourceRank

T2 - 20th International Conference on World Wide Web, WWW 2011

AU - Balakrishnan, Raju

AU - Kambhampati, Subbarao

PY - 2011

Y1 - 2011

N2 - One immediate challenge in searching the deep web databases is source selection-i.e. selecting the most relevant web databases for answering a given query. The existing database selection methods (both text and relational) assess the source quality based on the query-similarity-based relevance assessment. When applied to the deep web these methods have two deficiencies. First is that the methods are agnostic to the correctness (trustworthiness) of the sources. Secondly, the query based relevance does not consider the importance of the results. These two considerations are essential for the open collections like the deep web. Since a number of sources provide answers to any query, we conjuncture that the agreements between these answers are likely to be helpful in assessing the importance and the trustworthiness of the sources. We compute the agreement between the sources as the agreement of the answers returned. While computing the agreement, we also measure and compensate for possible collusion between the sources. This adjusted agreement is modeled as a graph with sources at the vertices. On this agreement graph, a quality score of a source that we call SourceRank, is calculated as the stationary visit probability of a random walk. We evaluate SourceRank in multiple domains, including sources in Google Base, with sizes up to 675 sources. We demonstrate that the SourceRank tracks source corruption. Further, our relevance evaluations show that SourceRank improves precision by 22-60% over the Google Base and other baseline methods. SourceRank has been implemented in a system called Factal.

AB - One immediate challenge in searching the deep web databases is source selection-i.e. selecting the most relevant web databases for answering a given query. The existing database selection methods (both text and relational) assess the source quality based on the query-similarity-based relevance assessment. When applied to the deep web these methods have two deficiencies. First is that the methods are agnostic to the correctness (trustworthiness) of the sources. Secondly, the query based relevance does not consider the importance of the results. These two considerations are essential for the open collections like the deep web. Since a number of sources provide answers to any query, we conjuncture that the agreements between these answers are likely to be helpful in assessing the importance and the trustworthiness of the sources. We compute the agreement between the sources as the agreement of the answers returned. While computing the agreement, we also measure and compensate for possible collusion between the sources. This adjusted agreement is modeled as a graph with sources at the vertices. On this agreement graph, a quality score of a source that we call SourceRank, is calculated as the stationary visit probability of a random walk. We evaluate SourceRank in multiple domains, including sources in Google Base, with sizes up to 675 sources. We demonstrate that the SourceRank tracks source corruption. Further, our relevance evaluations show that SourceRank improves precision by 22-60% over the Google Base and other baseline methods. SourceRank has been implemented in a system called Factal.

UR - http://www.scopus.com/inward/record.url?scp=84867149755&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84867149755&partnerID=8YFLogxK

U2 - 10.1145/1963405.1963440

DO - 10.1145/1963405.1963440

M3 - Conference contribution

AN - SCOPUS:84867149755

SN - 9781450306324

T3 - Proceedings of the 20th International Conference on World Wide Web, WWW 2011

SP - 227

EP - 236

BT - Proceedings of the 20th International Conference on World Wide Web, WWW 2011

Y2 - 28 March 2011 through 1 April 2011

ER -

SourceRank: Relevance and trust assessment for deep web sources based on inter-source agreement

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Cite this