Information extraction from Web pages using presentation regularities and domain knowledge

Srinivas Vadrevu; Fatih Gelgi; Hasan Davulcu

doi:10.1007/s11280-007-0021-1

Information extraction from Web pages using presentation regularities and domain knowledge

Srinivas Vadrevu, Fatih Gelgi, Hasan Davulcu

Research output: Contribution to journal › Article › peer-review

20 Scopus citations

Abstract

World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.

Original language	English (US)
Pages (from-to)	157-179
Number of pages	23
Journal	World Wide Web
Volume	10
Issue number	2
DOIs	https://doi.org/10.1007/s11280-007-0021-1
State	Published - Jun 2007

Keywords

Domain knowledge
Grammar induction
Information extraction
Metadata
Page segmentation
Pattern mining
Semantic partitioner
Statistical domain model
Web

ASJC Scopus subject areas

Software
Hardware and Architecture
Computer Networks and Communications

Access to Document

10.1007/s11280-007-0021-1

Cite this

@article{05b9660c368a4c3dbc08dfa01b21833c,

title = "Information extraction from Web pages using presentation regularities and domain knowledge",

abstract = "World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.",

keywords = "Domain knowledge, Grammar induction, Information extraction, Metadata, Page segmentation, Pattern mining, Semantic partitioner, Statistical domain model, Web",

author = "Srinivas Vadrevu and Fatih Gelgi and Hasan Davulcu",

note = "Funding Information: Acknowledgements This work was partially supported by the Office of Naval Research (ONR) under its Multidisciplinary Research Program of the University Research Initiative (MURI) under Grant No. N00014-04-1-0723.",

year = "2007",

month = jun,

doi = "10.1007/s11280-007-0021-1",

language = "English (US)",

volume = "10",

pages = "157--179",

journal = "World Wide Web",

issn = "1386-145X",

publisher = "Springer New York",

number = "2",

}

TY - JOUR

T1 - Information extraction from Web pages using presentation regularities and domain knowledge

AU - Vadrevu, Srinivas

AU - Gelgi, Fatih

AU - Davulcu, Hasan

N1 - Funding Information: Acknowledgements This work was partially supported by the Office of Naval Research (ONR) under its Multidisciplinary Research Program of the University Research Initiative (MURI) under Grant No. N00014-04-1-0723.

PY - 2007/6

Y1 - 2007/6

N2 - World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.

AB - World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.

KW - Domain knowledge

KW - Grammar induction

KW - Information extraction

KW - Metadata

KW - Page segmentation

KW - Pattern mining

KW - Semantic partitioner

KW - Statistical domain model

KW - Web

UR - http://www.scopus.com/inward/record.url?scp=34247895472&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34247895472&partnerID=8YFLogxK

U2 - 10.1007/s11280-007-0021-1

DO - 10.1007/s11280-007-0021-1

M3 - Article

AN - SCOPUS:34247895472

SN - 1386-145X

VL - 10

SP - 157

EP - 179

JO - World Wide Web

JF - World Wide Web

IS - 2

ER -

Information extraction from Web pages using presentation regularities and domain knowledge

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this