Information extraction from Web pages using presentation regularities and domain knowledge

Srinivas Vadrevu, Fatih Gelgi, Hasan Davulcu

Research output: Contribution to journalArticlepeer-review

20 Scopus citations

Abstract

World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.

Original languageEnglish (US)
Pages (from-to)157-179
Number of pages23
JournalWorld Wide Web
Volume10
Issue number2
DOIs
StatePublished - Jun 2007

Keywords

  • Domain knowledge
  • Grammar induction
  • Information extraction
  • Metadata
  • Page segmentation
  • Pattern mining
  • Semantic partitioner
  • Statistical domain model
  • Web

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Information extraction from Web pages using presentation regularities and domain knowledge'. Together they form a unique fingerprint.

Cite this