TY - GEN
T1 - A machine learning based approach for table detection on the web
AU - Wang, Yalin
AU - Hu, Jianying
PY - 2002
Y1 - 2002
N2 - Table is a commonly used presentation scheme, especially for describing relational information. However, table understanding remains an open problem. In this paper, we consider the problem of table detection in web documents. Its potential applications include web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. We describe a machine learning based approach to classify each given table entity as either genuine or non-genuine. Various features reflecting the layout as well as content characteristics of tables are studied.In order to facilitate the training and evaluation of our table classifier, we designed a novel web document table ground truthing protocol and used it to build a large table ground truth database. The database consists of 1,393 HTML files collected from hundreds of different web sites and contains 11,477 leaf TABLE elements, out of which 1,740 are genuine tables. Experiments were conducted using the cross validation method and an F-measure of 95.89% was achieved.
AB - Table is a commonly used presentation scheme, especially for describing relational information. However, table understanding remains an open problem. In this paper, we consider the problem of table detection in web documents. Its potential applications include web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. We describe a machine learning based approach to classify each given table entity as either genuine or non-genuine. Various features reflecting the layout as well as content characteristics of tables are studied.In order to facilitate the training and evaluation of our table classifier, we designed a novel web document table ground truthing protocol and used it to build a large table ground truth database. The database consists of 1,393 HTML files collected from hundreds of different web sites and contains 11,477 leaf TABLE elements, out of which 1,740 are genuine tables. Experiments were conducted using the cross validation method and an F-measure of 95.89% was achieved.
KW - Decision tree
KW - Information retrieval
KW - Layout analysis
KW - Machine learning
KW - Support vector machine
KW - Table detection
UR - http://www.scopus.com/inward/record.url?scp=33845353207&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33845353207&partnerID=8YFLogxK
U2 - 10.1145/511446.511478
DO - 10.1145/511446.511478
M3 - Conference contribution
AN - SCOPUS:33845353207
SN - 1581134495
SN - 9781581134490
T3 - Proceedings of the 11th International Conference on World Wide Web, WWW '02
SP - 242
EP - 250
BT - Proceedings of the 11th International Conference on World Wide Web, WWW '02
T2 - 11th International Conference on World Wide Web, WWW '02
Y2 - 7 May 2002 through 11 May 2002
ER -