OntoMiner: automated metadata and instance mining from news websites

Hasan Davulcu, Srinivas Vadrevu, Saravanakumar Nagarajan

Research output: Contribution to journalArticlepeer-review

7 Scopus citations

Abstract

RDF/XML has been widely recognised as the standard for annotating online web documents and for transforming the HTML web into the so-called Semantic Web. In order to enable widespread usability of the Semantic Web, there is a need to bootstrap large, rich and up-to-date domain ontologies that organise the most relevant concepts, their relationships and instances. In this paper, we present automated techniques for bootstrapping and populating specialised domain ontologies by organising and mining a set of relevant overlapping websites. We develop algorithms that detect and utilise HTML regularities in the web documents to turn them into hierarchical semantic structures encoded as XML. Next, we present tree-mining algorithms that identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. We also report experimental evaluation for the news, travel and shopping domains to demonstrate the efficacy of our algorithms.

Original languageEnglish (US)
Pages (from-to)196-221
Number of pages26
JournalInternational Journal of Web and Grid Services
Volume1
Issue number2
DOIs
StatePublished - 2005

Keywords

  • automation
  • instance ontology
  • metadata
  • mining
  • news
  • semantic
  • web

ASJC Scopus subject areas

  • Software
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'OntoMiner: automated metadata and instance mining from news websites'. Together they form a unique fingerprint.

Cite this