A machine learning based approach for table detection on the web

Yalin Wang; Jianying Hu

doi:10.1145/511446.511478

A machine learning based approach for table detection on the web

Yalin Wang, Jianying Hu

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

148 Scopus citations

Abstract

Table is a commonly used presentation scheme, especially for describing relational information. However, table understanding remains an open problem. In this paper, we consider the problem of table detection in web documents. Its potential applications include web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. We describe a machine learning based approach to classify each given table entity as either genuine or non-genuine. Various features reflecting the layout as well as content characteristics of tables are studied.In order to facilitate the training and evaluation of our table classifier, we designed a novel web document table ground truthing protocol and used it to build a large table ground truth database. The database consists of 1,393 HTML files collected from hundreds of different web sites and contains 11,477 leaf TABLE elements, out of which 1,740 are genuine tables. Experiments were conducted using the cross validation method and an F-measure of 95.89% was achieved.

Original language	English (US)
Title of host publication	Proceedings of the 11th International Conference on World Wide Web, WWW '02
Pages	242-250
Number of pages	9
DOIs	https://doi.org/10.1145/511446.511478
State	Published - 2002
Externally published	Yes
Event	11th International Conference on World Wide Web, WWW '02 - Honolulu, HI, United States Duration: May 7 2002 → May 11 2002

Publication series

Name	Proceedings of the 11th International Conference on World Wide Web, WWW '02

Other

Other	11th International Conference on World Wide Web, WWW '02
Country/Territory	United States
City	Honolulu, HI
Period	5/7/02 → 5/11/02

Keywords

Decision tree
Information retrieval
Layout analysis
Machine learning
Support vector machine
Table detection

ASJC Scopus subject areas

Computer Networks and Communications
Computer Science Applications

Access to Document

10.1145/511446.511478

Cite this

Wang, Y & Hu, J 2002, A machine learning based approach for table detection on the web. in Proceedings of the 11th International Conference on World Wide Web, WWW '02. Proceedings of the 11th International Conference on World Wide Web, WWW '02, pp. 242-250, 11th International Conference on World Wide Web, WWW '02, Honolulu, HI, United States, 5/7/02. https://doi.org/10.1145/511446.511478

@inproceedings{5fa300cd61c84b149d85697c46b78566,

title = "A machine learning based approach for table detection on the web",

abstract = "Table is a commonly used presentation scheme, especially for describing relational information. However, table understanding remains an open problem. In this paper, we consider the problem of table detection in web documents. Its potential applications include web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. We describe a machine learning based approach to classify each given table entity as either genuine or non-genuine. Various features reflecting the layout as well as content characteristics of tables are studied.In order to facilitate the training and evaluation of our table classifier, we designed a novel web document table ground truthing protocol and used it to build a large table ground truth database. The database consists of 1,393 HTML files collected from hundreds of different web sites and contains 11,477 leaf TABLE elements, out of which 1,740 are genuine tables. Experiments were conducted using the cross validation method and an F-measure of 95.89% was achieved.",

keywords = "Decision tree, Information retrieval, Layout analysis, Machine learning, Support vector machine, Table detection",

author = "Yalin Wang and Jianying Hu",

year = "2002",

doi = "10.1145/511446.511478",

language = "English (US)",

isbn = "1581134495",

series = "Proceedings of the 11th International Conference on World Wide Web, WWW '02",

pages = "242--250",

booktitle = "Proceedings of the 11th International Conference on World Wide Web, WWW '02",

note = "11th International Conference on World Wide Web, WWW '02 ; Conference date: 07-05-2002 Through 11-05-2002",

}

TY - GEN

T1 - A machine learning based approach for table detection on the web

AU - Wang, Yalin

AU - Hu, Jianying

PY - 2002

Y1 - 2002

N2 - Table is a commonly used presentation scheme, especially for describing relational information. However, table understanding remains an open problem. In this paper, we consider the problem of table detection in web documents. Its potential applications include web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. We describe a machine learning based approach to classify each given table entity as either genuine or non-genuine. Various features reflecting the layout as well as content characteristics of tables are studied.In order to facilitate the training and evaluation of our table classifier, we designed a novel web document table ground truthing protocol and used it to build a large table ground truth database. The database consists of 1,393 HTML files collected from hundreds of different web sites and contains 11,477 leaf TABLE elements, out of which 1,740 are genuine tables. Experiments were conducted using the cross validation method and an F-measure of 95.89% was achieved.

AB - Table is a commonly used presentation scheme, especially for describing relational information. However, table understanding remains an open problem. In this paper, we consider the problem of table detection in web documents. Its potential applications include web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. We describe a machine learning based approach to classify each given table entity as either genuine or non-genuine. Various features reflecting the layout as well as content characteristics of tables are studied.In order to facilitate the training and evaluation of our table classifier, we designed a novel web document table ground truthing protocol and used it to build a large table ground truth database. The database consists of 1,393 HTML files collected from hundreds of different web sites and contains 11,477 leaf TABLE elements, out of which 1,740 are genuine tables. Experiments were conducted using the cross validation method and an F-measure of 95.89% was achieved.

KW - Decision tree

KW - Information retrieval

KW - Layout analysis

KW - Machine learning

KW - Support vector machine

KW - Table detection

UR - http://www.scopus.com/inward/record.url?scp=33845353207&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33845353207&partnerID=8YFLogxK

U2 - 10.1145/511446.511478

DO - 10.1145/511446.511478

M3 - Conference contribution

AN - SCOPUS:33845353207

SN - 1581134495

SN - 9781581134490

T3 - Proceedings of the 11th International Conference on World Wide Web, WWW '02

SP - 242

EP - 250

BT - Proceedings of the 11th International Conference on World Wide Web, WWW '02

T2 - 11th International Conference on World Wide Web, WWW '02

Y2 - 7 May 2002 through 11 May 2002

ER -

A machine learning based approach for table detection on the web

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this