Detecting tables in HTML documents

Yalin Wang; Jianying Hu

doi:10.1007/3-540-45869-7_29

Detecting tables in HTML documents

Yalin Wang, Jianying Hu

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

43 Scopus citations

Abstract

Table is a commonly used presentation scheme for describing relational information. Table understanding on the web has many potential applications including webmining, knowledge management, and webcon tent summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as <table> elements, a <table> element does not necessarily indicate the presence of a genuine relational table. Thus the important first step in table understanding in the webdomain is the detection of the genuine tables. In our earlier work we designed a basic rule-based algorithm to detect genuine tables in major news and corporate home pages as part of a web content filtering system. In this paper we investigate a machine learning based approach that is trainable and thus can be automatically generalized to including any domain. Various features reflecting the layout as well as content characteristics of tables are explored. The system is tested on a large database which consists of 1, 393 HTML files collected from hundreds of different websites from various domains and contains over 10, 000 leaf <table> elements. Experiments were conducted using the cross validation method. The machine learning based approach outperformed the rule-based system and achieved an F-measure of 95.88%.

Original language	English (US)
Title of host publication	Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings
Editors	Daniel Lopresti, Jianying Hu, Ramanujan Kashi
Publisher	Springer Verlag
Pages	249-260
Number of pages	12
ISBN (Print)	3540440682, 9783540440680
DOIs	https://doi.org/10.1007/3-540-45869-7_29
State	Published - 2002
Externally published	Yes
Event	5th International Workshop on Document Analysis Systems, DAS 2002 - Princeton, United States Duration: Aug 19 2002 → Aug 21 2002

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	2423
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Other

Other	5th International Workshop on Document Analysis Systems, DAS 2002
Country/Territory	United States
City	Princeton
Period	8/19/02 → 8/21/02

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/3-540-45869-7_29

Cite this

Wang, Y., & Hu, J. (2002). Detecting tables in HTML documents. In D. Lopresti, J. Hu, & R. Kashi (Eds.), Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings (pp. 249-260). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2423). Springer Verlag. https://doi.org/10.1007/3-540-45869-7_29

Detecting tables in HTML documents. / Wang, Yalin; Hu, Jianying.
Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings. ed. / Daniel Lopresti; Jianying Hu; Ramanujan Kashi. Springer Verlag, 2002. p. 249-260 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2423).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Wang, Y & Hu, J 2002, Detecting tables in HTML documents. in D Lopresti, J Hu & R Kashi (eds), Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 2423, Springer Verlag, pp. 249-260, 5th International Workshop on Document Analysis Systems, DAS 2002, Princeton, United States, 8/19/02. https://doi.org/10.1007/3-540-45869-7_29

Wang Y, Hu J. Detecting tables in HTML documents. In Lopresti D, Hu J, Kashi R, editors, Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings. Springer Verlag. 2002. p. 249-260. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/3-540-45869-7_29

Wang, Yalin ; Hu, Jianying. / Detecting tables in HTML documents. Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings. editor / Daniel Lopresti ; Jianying Hu ; Ramanujan Kashi. Springer Verlag, 2002. pp. 249-260 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{d5699b8293b94fc6b852fc8d9fa4c446,

title = "Detecting tables in HTML documents",

abstract = "Table is a commonly used presentation scheme for describing relational information. Table understanding on the web has many potential applications including webmining, knowledge management, and webcon tent summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as elements, a element does not necessarily indicate the presence of a genuine relational table. Thus the important first step in table understanding in the webdomain is the detection of the genuine tables. In our earlier work we designed a basic rule-based algorithm to detect genuine tables in major news and corporate home pages as part of a web content filtering system. In this paper we investigate a machine learning based approach that is trainable and thus can be automatically generalized to including any domain. Various features reflecting the layout as well as content characteristics of tables are explored. The system is tested on a large database which consists of 1, 393 HTML files collected from hundreds of different websites from various domains and contains over 10, 000 leaf elements. Experiments were conducted using the cross validation method. The machine learning based approach outperformed the rule-based system and achieved an F-measure of 95.88%.",

author = "Yalin Wang and Jianying Hu",

note = "Publisher Copyright: {\textcopyright} Springer-Verlag Berlin Heidelberg 2002.; 5th International Workshop on Document Analysis Systems, DAS 2002 ; Conference date: 19-08-2002 Through 21-08-2002",

year = "2002",

doi = "10.1007/3-540-45869-7_29",

language = "English (US)",

isbn = "3540440682",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "249--260",

editor = "Daniel Lopresti and Jianying Hu and Ramanujan Kashi",

booktitle = "Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings",

}

TY - GEN

T1 - Detecting tables in HTML documents

AU - Wang, Yalin

AU - Hu, Jianying

N1 - Publisher Copyright: © Springer-Verlag Berlin Heidelberg 2002.

PY - 2002

Y1 - 2002

N2 - Table is a commonly used presentation scheme for describing relational information. Table understanding on the web has many potential applications including webmining, knowledge management, and webcon tent summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as elements, a element does not necessarily indicate the presence of a genuine relational table. Thus the important first step in table understanding in the webdomain is the detection of the genuine tables. In our earlier work we designed a basic rule-based algorithm to detect genuine tables in major news and corporate home pages as part of a web content filtering system. In this paper we investigate a machine learning based approach that is trainable and thus can be automatically generalized to including any domain. Various features reflecting the layout as well as content characteristics of tables are explored. The system is tested on a large database which consists of 1, 393 HTML files collected from hundreds of different websites from various domains and contains over 10, 000 leaf elements. Experiments were conducted using the cross validation method. The machine learning based approach outperformed the rule-based system and achieved an F-measure of 95.88%.

AB - Table is a commonly used presentation scheme for describing relational information. Table understanding on the web has many potential applications including webmining, knowledge management, and webcon tent summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as elements, a element does not necessarily indicate the presence of a genuine relational table. Thus the important first step in table understanding in the webdomain is the detection of the genuine tables. In our earlier work we designed a basic rule-based algorithm to detect genuine tables in major news and corporate home pages as part of a web content filtering system. In this paper we investigate a machine learning based approach that is trainable and thus can be automatically generalized to including any domain. Various features reflecting the layout as well as content characteristics of tables are explored. The system is tested on a large database which consists of 1, 393 HTML files collected from hundreds of different websites from various domains and contains over 10, 000 leaf elements. Experiments were conducted using the cross validation method. The machine learning based approach outperformed the rule-based system and achieved an F-measure of 95.88%.

UR - http://www.scopus.com/inward/record.url?scp=84947754623&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84947754623&partnerID=8YFLogxK

U2 - 10.1007/3-540-45869-7_29

DO - 10.1007/3-540-45869-7_29

M3 - Conference contribution

AN - SCOPUS:84947754623

SN - 3540440682

SN - 9783540440680

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 249

EP - 260

BT - Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings

A2 - Lopresti, Daniel

A2 - Hu, Jianying

A2 - Kashi, Ramanujan

PB - Springer Verlag

T2 - 5th International Workshop on Document Analysis Systems, DAS 2002

Y2 - 19 August 2002 through 21 August 2002

ER -

Detecting tables in HTML documents

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this