Automated metadata and instance extraction from news Web sites

Srinivas Vadrevu, Saravanakumar Nagarajan, Fatih Gelgi, Hasan Davulcu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

Over the past few years World Wide Web has established as a vital resource for news. With the continuous growth in the number of available news Web sites and the diversity in their presentation of content, there is an increasing need to organize the news related information on the Web and keep track of it. In this paper, we present automated techniques for extracting metadata instance information by organizing and mining a set of news Web sites. We develop algorithms that detect and utilize HTML regularities in the Web documents to turn them into hierarchical semantic structures encoded as XML. The tree-mining algorithms that we present identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. We report experimental evaluation for the news domain to demonstrate the efficacy of our algorithms.

Original languageEnglish (US)
Title of host publicationProceedings - 2005 IEEE/WIC/ACM InternationalConference on Web Intelligence, WI 2005
Pages38-41
Number of pages4
Volume2005
DOIs
StatePublished - 2005
Event2005 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2005 - Compiegne Cedex, France
Duration: Sep 19 2005Sep 22 2005

Other

Other2005 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2005
CountryFrance
CityCompiegne Cedex,
Period9/19/059/22/05

Fingerprint

Metadata
Websites
World Wide Web
HTML
XML
Labels
Semantics

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Vadrevu, S., Nagarajan, S., Gelgi, F., & Davulcu, H. (2005). Automated metadata and instance extraction from news Web sites. In Proceedings - 2005 IEEE/WIC/ACM InternationalConference on Web Intelligence, WI 2005 (Vol. 2005, pp. 38-41). [1517813] https://doi.org/10.1109/WI.2005.38

Automated metadata and instance extraction from news Web sites. / Vadrevu, Srinivas; Nagarajan, Saravanakumar; Gelgi, Fatih; Davulcu, Hasan.

Proceedings - 2005 IEEE/WIC/ACM InternationalConference on Web Intelligence, WI 2005. Vol. 2005 2005. p. 38-41 1517813.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Vadrevu, S, Nagarajan, S, Gelgi, F & Davulcu, H 2005, Automated metadata and instance extraction from news Web sites. in Proceedings - 2005 IEEE/WIC/ACM InternationalConference on Web Intelligence, WI 2005. vol. 2005, 1517813, pp. 38-41, 2005 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2005, Compiegne Cedex, France, 9/19/05. https://doi.org/10.1109/WI.2005.38
Vadrevu S, Nagarajan S, Gelgi F, Davulcu H. Automated metadata and instance extraction from news Web sites. In Proceedings - 2005 IEEE/WIC/ACM InternationalConference on Web Intelligence, WI 2005. Vol. 2005. 2005. p. 38-41. 1517813 https://doi.org/10.1109/WI.2005.38
Vadrevu, Srinivas ; Nagarajan, Saravanakumar ; Gelgi, Fatih ; Davulcu, Hasan. / Automated metadata and instance extraction from news Web sites. Proceedings - 2005 IEEE/WIC/ACM InternationalConference on Web Intelligence, WI 2005. Vol. 2005 2005. pp. 38-41
@inproceedings{a81eca385f4643018e96d04ad54c2425,
title = "Automated metadata and instance extraction from news Web sites",
abstract = "Over the past few years World Wide Web has established as a vital resource for news. With the continuous growth in the number of available news Web sites and the diversity in their presentation of content, there is an increasing need to organize the news related information on the Web and keep track of it. In this paper, we present automated techniques for extracting metadata instance information by organizing and mining a set of news Web sites. We develop algorithms that detect and utilize HTML regularities in the Web documents to turn them into hierarchical semantic structures encoded as XML. The tree-mining algorithms that we present identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. We report experimental evaluation for the news domain to demonstrate the efficacy of our algorithms.",
author = "Srinivas Vadrevu and Saravanakumar Nagarajan and Fatih Gelgi and Hasan Davulcu",
year = "2005",
doi = "10.1109/WI.2005.38",
language = "English (US)",
isbn = "076952415X",
volume = "2005",
pages = "38--41",
booktitle = "Proceedings - 2005 IEEE/WIC/ACM InternationalConference on Web Intelligence, WI 2005",

}

TY - GEN

T1 - Automated metadata and instance extraction from news Web sites

AU - Vadrevu, Srinivas

AU - Nagarajan, Saravanakumar

AU - Gelgi, Fatih

AU - Davulcu, Hasan

PY - 2005

Y1 - 2005

N2 - Over the past few years World Wide Web has established as a vital resource for news. With the continuous growth in the number of available news Web sites and the diversity in their presentation of content, there is an increasing need to organize the news related information on the Web and keep track of it. In this paper, we present automated techniques for extracting metadata instance information by organizing and mining a set of news Web sites. We develop algorithms that detect and utilize HTML regularities in the Web documents to turn them into hierarchical semantic structures encoded as XML. The tree-mining algorithms that we present identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. We report experimental evaluation for the news domain to demonstrate the efficacy of our algorithms.

AB - Over the past few years World Wide Web has established as a vital resource for news. With the continuous growth in the number of available news Web sites and the diversity in their presentation of content, there is an increasing need to organize the news related information on the Web and keep track of it. In this paper, we present automated techniques for extracting metadata instance information by organizing and mining a set of news Web sites. We develop algorithms that detect and utilize HTML regularities in the Web documents to turn them into hierarchical semantic structures encoded as XML. The tree-mining algorithms that we present identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. We report experimental evaluation for the news domain to demonstrate the efficacy of our algorithms.

UR - http://www.scopus.com/inward/record.url?scp=33748849300&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33748849300&partnerID=8YFLogxK

U2 - 10.1109/WI.2005.38

DO - 10.1109/WI.2005.38

M3 - Conference contribution

AN - SCOPUS:33748849300

SN - 076952415X

SN - 9780769524153

VL - 2005

SP - 38

EP - 41

BT - Proceedings - 2005 IEEE/WIC/ACM InternationalConference on Web Intelligence, WI 2005

ER -