TY - GEN
T1 - Detecting tables in HTML documents
AU - Wang, Yalin
AU - Hu, Jianying
N1 - Publisher Copyright:
© Springer-Verlag Berlin Heidelberg 2002.
PY - 2002
Y1 - 2002
N2 - Table is a commonly used presentation scheme for describing relational information. Table understanding on the web has many potential applications including webmining, knowledge management, and webcon tent summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as elements, a element does not necessarily indicate the presence of a genuine relational table. Thus the important first step in table understanding in the webdomain is the detection of the genuine tables. In our earlier work we designed a basic rule-based algorithm to detect genuine tables in major news and corporate home pages as part of a web content filtering system. In this paper we investigate a machine learning based approach that is trainable and thus can be automatically generalized to including any domain. Various features reflecting the layout as well as content characteristics of tables are explored. The system is tested on a large database which consists of 1, 393 HTML files collected from hundreds of different websites from various domains and contains over 10, 000 leaf elements. Experiments were conducted using the cross validation method. The machine learning based approach outperformed the rule-based system and achieved an F-measure of 95.88%.
AB - Table is a commonly used presentation scheme for describing relational information. Table understanding on the web has many potential applications including webmining, knowledge management, and webcon tent summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as elements, a element does not necessarily indicate the presence of a genuine relational table. Thus the important first step in table understanding in the webdomain is the detection of the genuine tables. In our earlier work we designed a basic rule-based algorithm to detect genuine tables in major news and corporate home pages as part of a web content filtering system. In this paper we investigate a machine learning based approach that is trainable and thus can be automatically generalized to including any domain. Various features reflecting the layout as well as content characteristics of tables are explored. The system is tested on a large database which consists of 1, 393 HTML files collected from hundreds of different websites from various domains and contains over 10, 000 leaf elements. Experiments were conducted using the cross validation method. The machine learning based approach outperformed the rule-based system and achieved an F-measure of 95.88%.
UR - http://www.scopus.com/inward/record.url?scp=84947754623&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84947754623&partnerID=8YFLogxK
U2 - 10.1007/3-540-45869-7_29
DO - 10.1007/3-540-45869-7_29
M3 - Conference contribution
AN - SCOPUS:84947754623
SN - 3540440682
SN - 9783540440680
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 249
EP - 260
BT - Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings
A2 - Lopresti, Daniel
A2 - Hu, Jianying
A2 - Kashi, Ramanujan
PB - Springer Verlag
T2 - 5th International Workshop on Document Analysis Systems, DAS 2002
Y2 - 19 August 2002 through 21 August 2002
ER -