Retrieving and organizing web pages by "information unit"

Wen Syan Li; Kasim Candan; Quoc Vu; Divyakant Agrawal

doi:10.1145/371920.372057

Retrieving and organizing web pages by "information unit"

Wen Syan Li, Kasim Candan, Quoc Vu, Divyakant Agrawal

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

82 Scopus citations

Abstract

Since WWW encourages hypertext and hypermedia docu-ment authoring (e.g., HTML or XML), Web authors tend to create documents that are composed of multiple pages con-nected with hyperlinks or frames. A Web document may be authored in multiple ways, such as (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, return only physical pages. In this paper, we introduce and describe the use of the concept of informa-tion unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit. We present an algorithm to eÆciently retrieve infor-mation units. Our algorithm can perform progressive query processing over a Web index by considering both document semantic similarity and link structures. Experimental re-sults on synthetic graphs and real Web data show the ef-fectiveness and usefulness of the proposed information unit retrieval technique.

Original language	English (US)
Title of host publication	Proceedings of the 10th International Conference on World Wide Web, WWW 2001
Publisher	Association for Computing Machinery, Inc
Pages	230-244
Number of pages	15
ISBN (Print)	1581133480, 9781581133486
DOIs	https://doi.org/10.1145/371920.372057
State	Published - Apr 1 2001
Externally published	Yes
Event	10th International Conference on World Wide Web, WWW 2001 - Hong Kong, Hong Kong Duration: May 1 2001 → May 5 2001

Other

Other	10th International Conference on World Wide Web, WWW 2001
Country/Territory	Hong Kong
City	Hong Kong
Period	5/1/01 → 5/5/01

Keywords

Link structures
Pro-gressive processing
Query relaxation
Web proximity search

ASJC Scopus subject areas

Computer Networks and Communications

Access to Document

10.1145/371920.372057

Cite this

@inproceedings{e26a32e1622c464b944185ba094e0b48,

title = "Retrieving and organizing web pages by {"}information unit{"}",

abstract = "Since WWW encourages hypertext and hypermedia docu-ment authoring (e.g., HTML or XML), Web authors tend to create documents that are composed of multiple pages con-nected with hyperlinks or frames. A Web document may be authored in multiple ways, such as (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, return only physical pages. In this paper, we introduce and describe the use of the concept of informa-tion unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit. We present an algorithm to e{\AE}ciently retrieve infor-mation units. Our algorithm can perform progressive query processing over a Web index by considering both document semantic similarity and link structures. Experimental re-sults on synthetic graphs and real Web data show the ef-fectiveness and usefulness of the proposed information unit retrieval technique.",

keywords = "Link structures, Pro-gressive processing, Query relaxation, Web proximity search",

author = "Li, {Wen Syan} and Kasim Candan and Quoc Vu and Divyakant Agrawal",

year = "2001",

month = apr,

day = "1",

doi = "10.1145/371920.372057",

language = "English (US)",

isbn = "1581133480",

pages = "230--244",

booktitle = "Proceedings of the 10th International Conference on World Wide Web, WWW 2001",

publisher = "Association for Computing Machinery, Inc",

note = "10th International Conference on World Wide Web, WWW 2001 ; Conference date: 01-05-2001 Through 05-05-2001",

}

TY - GEN

T1 - Retrieving and organizing web pages by "information unit"

AU - Li, Wen Syan

AU - Candan, Kasim

AU - Vu, Quoc

AU - Agrawal, Divyakant

PY - 2001/4/1

Y1 - 2001/4/1

N2 - Since WWW encourages hypertext and hypermedia docu-ment authoring (e.g., HTML or XML), Web authors tend to create documents that are composed of multiple pages con-nected with hyperlinks or frames. A Web document may be authored in multiple ways, such as (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, return only physical pages. In this paper, we introduce and describe the use of the concept of informa-tion unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit. We present an algorithm to eÆciently retrieve infor-mation units. Our algorithm can perform progressive query processing over a Web index by considering both document semantic similarity and link structures. Experimental re-sults on synthetic graphs and real Web data show the ef-fectiveness and usefulness of the proposed information unit retrieval technique.

AB - Since WWW encourages hypertext and hypermedia docu-ment authoring (e.g., HTML or XML), Web authors tend to create documents that are composed of multiple pages con-nected with hyperlinks or frames. A Web document may be authored in multiple ways, such as (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, return only physical pages. In this paper, we introduce and describe the use of the concept of informa-tion unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit. We present an algorithm to eÆciently retrieve infor-mation units. Our algorithm can perform progressive query processing over a Web index by considering both document semantic similarity and link structures. Experimental re-sults on synthetic graphs and real Web data show the ef-fectiveness and usefulness of the proposed information unit retrieval technique.

KW - Link structures

KW - Pro-gressive processing

KW - Query relaxation

KW - Web proximity search

UR - http://www.scopus.com/inward/record.url?scp=85013624238&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85013624238&partnerID=8YFLogxK

U2 - 10.1145/371920.372057

DO - 10.1145/371920.372057

M3 - Conference contribution

AN - SCOPUS:85013624238

SN - 1581133480

SN - 9781581133486

SP - 230

EP - 244

BT - Proceedings of the 10th International Conference on World Wide Web, WWW 2001

PB - Association for Computing Machinery, Inc

T2 - 10th International Conference on World Wide Web, WWW 2001

Y2 - 1 May 2001 through 5 May 2001

ER -

Retrieving and organizing web pages by "information unit"

Abstract

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this