Detecting tables in HTML documents

Yalin Wang, Jianying Hu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

43 Scopus citations

Abstract

Table is a commonly used presentation scheme for describing relational information. Table understanding on the web has many potential applications including webmining, knowledge management, and webcon tent summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as <table> elements, a <table> element does not necessarily indicate the presence of a genuine relational table. Thus the important first step in table understanding in the webdomain is the detection of the genuine tables. In our earlier work we designed a basic rule-based algorithm to detect genuine tables in major news and corporate home pages as part of a web content filtering system. In this paper we investigate a machine learning based approach that is trainable and thus can be automatically generalized to including any domain. Various features reflecting the layout as well as content characteristics of tables are explored. The system is tested on a large database which consists of 1, 393 HTML files collected from hundreds of different websites from various domains and contains over 10, 000 leaf <table> elements. Experiments were conducted using the cross validation method. The machine learning based approach outperformed the rule-based system and achieved an F-measure of 95.88%.

Original languageEnglish (US)
Title of host publicationDocument Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings
EditorsDaniel Lopresti, Jianying Hu, Ramanujan Kashi
PublisherSpringer Verlag
Pages249-260
Number of pages12
ISBN (Print)3540440682, 9783540440680
DOIs
StatePublished - 2002
Externally publishedYes
Event5th International Workshop on Document Analysis Systems, DAS 2002 - Princeton, United States
Duration: Aug 19 2002Aug 21 2002

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2423
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other5th International Workshop on Document Analysis Systems, DAS 2002
Country/TerritoryUnited States
CityPrinceton
Period8/19/028/21/02

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Detecting tables in HTML documents'. Together they form a unique fingerprint.

Cite this