A machine learning based approach for table detection on the web

Yalin Wang, Jianying Hu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

148 Scopus citations

Abstract

Table is a commonly used presentation scheme, especially for describing relational information. However, table understanding remains an open problem. In this paper, we consider the problem of table detection in web documents. Its potential applications include web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. We describe a machine learning based approach to classify each given table entity as either genuine or non-genuine. Various features reflecting the layout as well as content characteristics of tables are studied.In order to facilitate the training and evaluation of our table classifier, we designed a novel web document table ground truthing protocol and used it to build a large table ground truth database. The database consists of 1,393 HTML files collected from hundreds of different web sites and contains 11,477 leaf TABLE elements, out of which 1,740 are genuine tables. Experiments were conducted using the cross validation method and an F-measure of 95.89% was achieved.

Original languageEnglish (US)
Title of host publicationProceedings of the 11th International Conference on World Wide Web, WWW '02
Pages242-250
Number of pages9
DOIs
StatePublished - 2002
Externally publishedYes
Event11th International Conference on World Wide Web, WWW '02 - Honolulu, HI, United States
Duration: May 7 2002May 11 2002

Publication series

NameProceedings of the 11th International Conference on World Wide Web, WWW '02

Other

Other11th International Conference on World Wide Web, WWW '02
Country/TerritoryUnited States
CityHonolulu, HI
Period5/7/025/11/02

Keywords

  • Decision tree
  • Information retrieval
  • Layout analysis
  • Machine learning
  • Support vector machine
  • Table detection

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'A machine learning based approach for table detection on the web'. Together they form a unique fingerprint.

Cite this