DataRover: A taxonomy based crawler for automated data extraction from data-intensive websites

Hasan Davulcu, S. Koduri, S. Nagarajan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Scopus citations

Abstract

The advent of e-commerce has created a trend that brought thousands of catalogs online. Most of these websites are "taxonomy-directed". A Web site is said to be "taxonomy-directed" if it contains at least one taxonomy for organizing its contents and it presents the instances belonging to a category in a regular fashion. This paper describes the DataRover system, which can automatically crawl and extract products from taxonomy-directed online catalogs. DataRover utilizes heuristic rules to discover the structural regularities among: taxonomy segments, list-of-product and single-product pages and it uses these regularities to turn the online catalogs into a database of categorized products without the need for user interaction or the wrapper maintenance burden. We provide experimental results to demonstrate the efficacy of the DataRover and point to its current limitations.

Original languageEnglish (US)
Title of host publicationProceedings of the Interntational Workshop on Web Information and Data Management
EditorsR. Chiang, A.H.F. Laender, E.-P. Lim
Pages9-14
Number of pages6
StatePublished - 2003
EventWIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management - New Orleans, LA, United States
Duration: Nov 7 2003Nov 8 2003

Other

OtherWIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management
CountryUnited States
CityNew Orleans, LA
Period11/7/0311/8/03

Keywords

  • Web Annotation
  • Web Data Extraction
  • Web Data Integration

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems

Fingerprint Dive into the research topics of 'DataRover: A taxonomy based crawler for automated data extraction from data-intensive websites'. Together they form a unique fingerprint.

Cite this