Abstract
Over the past few years World Wide Web has established as a vital resource for news. With the continuous growth in the number of available news Web sites and the diversity in their presentation of content, there is an increasing need to organize the news related information on the Web and keep track of it. In this paper, we present automated techniques for extracting metadata instance information by organizing and mining a set of news Web sites. We develop algorithms that detect and utilize HTML regularities in the Web documents to turn them into hierarchical semantic structures encoded as XML. The tree-mining algorithms that we present identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. We report experimental evaluation for the news domain to demonstrate the efficacy of our algorithms.
Original language | English (US) |
---|---|
Title of host publication | Proceedings - 2005 IEEE/WIC/ACM InternationalConference on Web Intelligence, WI 2005 |
Pages | 38-41 |
Number of pages | 4 |
Volume | 2005 |
DOIs | |
State | Published - 2005 |
Event | 2005 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2005 - Compiegne Cedex, France Duration: Sep 19 2005 → Sep 22 2005 |
Other
Other | 2005 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2005 |
---|---|
Country | France |
City | Compiegne Cedex, |
Period | 9/19/05 → 9/22/05 |
Fingerprint
ASJC Scopus subject areas
- Engineering(all)
Cite this
Automated metadata and instance extraction from news Web sites. / Vadrevu, Srinivas; Nagarajan, Saravanakumar; Gelgi, Fatih; Davulcu, Hasan.
Proceedings - 2005 IEEE/WIC/ACM InternationalConference on Web Intelligence, WI 2005. Vol. 2005 2005. p. 38-41 1517813.Research output: Chapter in Book/Report/Conference proceeding › Conference contribution
}
TY - GEN
T1 - Automated metadata and instance extraction from news Web sites
AU - Vadrevu, Srinivas
AU - Nagarajan, Saravanakumar
AU - Gelgi, Fatih
AU - Davulcu, Hasan
PY - 2005
Y1 - 2005
N2 - Over the past few years World Wide Web has established as a vital resource for news. With the continuous growth in the number of available news Web sites and the diversity in their presentation of content, there is an increasing need to organize the news related information on the Web and keep track of it. In this paper, we present automated techniques for extracting metadata instance information by organizing and mining a set of news Web sites. We develop algorithms that detect and utilize HTML regularities in the Web documents to turn them into hierarchical semantic structures encoded as XML. The tree-mining algorithms that we present identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. We report experimental evaluation for the news domain to demonstrate the efficacy of our algorithms.
AB - Over the past few years World Wide Web has established as a vital resource for news. With the continuous growth in the number of available news Web sites and the diversity in their presentation of content, there is an increasing need to organize the news related information on the Web and keep track of it. In this paper, we present automated techniques for extracting metadata instance information by organizing and mining a set of news Web sites. We develop algorithms that detect and utilize HTML regularities in the Web documents to turn them into hierarchical semantic structures encoded as XML. The tree-mining algorithms that we present identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. We report experimental evaluation for the news domain to demonstrate the efficacy of our algorithms.
UR - http://www.scopus.com/inward/record.url?scp=33748849300&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33748849300&partnerID=8YFLogxK
U2 - 10.1109/WI.2005.38
DO - 10.1109/WI.2005.38
M3 - Conference contribution
AN - SCOPUS:33748849300
SN - 076952415X
SN - 9780769524153
VL - 2005
SP - 38
EP - 41
BT - Proceedings - 2005 IEEE/WIC/ACM InternationalConference on Web Intelligence, WI 2005
ER -