TY - GEN
T1 - Fixing weakly annotated Web data using relational models
AU - Gelgi, Fatih
AU - Vadrevu, Srinivas
AU - Davulcu, Hasan
PY - 2007
Y1 - 2007
N2 - In this paper, we present a fast and scalable Bayesian model for improving weakly annotated data - which is typically generated by a (semi) automated information extraction (IE) system from Web documents. Weakly annotated data suffers from two major problems: they (i) might contain incorrect ontological role assignments, and (ii) might have many missing attributes. Our experimental evaluations with the TAP and RoadRunner data sets, and a collection of 20,000 home pages from university, shopping and sports Web sites, indicate that the model described here can improve the accuracy of role assignments from 40% to 85% for template driven sites, from 68% to 87% for non-template driven sites. The Bayesian model is also shown to be useful for improving the performance of IE systems by informing them with additional domain information.
AB - In this paper, we present a fast and scalable Bayesian model for improving weakly annotated data - which is typically generated by a (semi) automated information extraction (IE) system from Web documents. Weakly annotated data suffers from two major problems: they (i) might contain incorrect ontological role assignments, and (ii) might have many missing attributes. Our experimental evaluations with the TAP and RoadRunner data sets, and a collection of 20,000 home pages from university, shopping and sports Web sites, indicate that the model described here can improve the accuracy of role assignments from 40% to 85% for template driven sites, from 68% to 87% for non-template driven sites. The Bayesian model is also shown to be useful for improving the performance of IE systems by informing them with additional domain information.
KW - Bayesian models
KW - Classification
KW - Information extraction
KW - Weakly annotated data
UR - http://www.scopus.com/inward/record.url?scp=38149003149&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=38149003149&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-73597-7_32
DO - 10.1007/978-3-540-73597-7_32
M3 - Conference contribution
AN - SCOPUS:38149003149
SN - 3540735968
SN - 9783540735960
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 385
EP - 399
BT - Web Engineering - 7th International Conference, ICWE 2007, Proceedings
PB - Springer Verlag
T2 - 7th International Conference on Web Engineering, ICWE 2007
Y2 - 16 July 2007 through 20 July 2007
ER -