In this paper, we present a fast and scalable Bayesian model for improving weakly annotated data - which is typically generated by a (semi) automated information extraction (IE) system from Web documents. Weakly annotated data suffers from incorrect ontological role assignments. Our experimental evaluations with the TAP and a collection of 20,000 home pages from university, shopping and sports Web sites, indicate that the model described here can improve the accuracy of role assignments from 40% to 85% for template driven sites, from 68% to 87% for non-template driven sites.
- Bayesian models
- Information extraction
- Weakly annotated data
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- Computational Mechanics
- Computer Science Applications