Redundancy-driven web data extraction and integration

Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Scopus citations

Abstract

A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages conffrm the feasibility and quality of the approach.

Original languageEnglish (US)
Title of host publicationProceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010
PublisherAssociation for Computing Machinery
ISBN (Print)9781450301862
DOIs
StatePublished - 2010
Externally publishedYes
Event13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010 - Indianapolis, IN, United States
Duration: Jun 6 2010Jun 6 2010

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Other

Other13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010
Country/TerritoryUnited States
CityIndianapolis, IN
Period6/6/106/6/10

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'Redundancy-driven web data extraction and integration'. Together they form a unique fingerprint.

Cite this