Exploiting information redundancy to wring out structured data from the web

Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Scopus citations

Abstract

A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g. financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments confirmed the quality and the feasibility of the approach.

Original languageEnglish (US)
Title of host publicationProceedings of the 19th International Conference on World Wide Web, WWW '10
Pages1063-1064
Number of pages2
DOIs
StatePublished - 2010
Externally publishedYes
Event19th International World Wide Web Conference, WWW2010 - Raleigh, NC, United States
Duration: Apr 26 2010Apr 30 2010

Publication series

NameProceedings of the 19th International Conference on World Wide Web, WWW '10

Other

Other19th International World Wide Web Conference, WWW2010
Country/TerritoryUnited States
CityRaleigh, NC
Period4/26/104/30/10

Keywords

  • data extraction
  • data integration
  • wrapper generation

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Exploiting information redundancy to wring out structured data from the web'. Together they form a unique fingerprint.

Cite this