Contextual data extraction and instancebased integration

Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti

Research output: Contribution to journalConference articlepeer-review

Abstract

We propose a formal framework for an unsupervised approach tacking at the same time two problems: The data extraction problem, for generating the extraction rules needed to gain data from web pages, and the data integration problem, to integrate the data coming from several sources. We motivate the approach by discussing its advantages with regard to the traditional "waterfall approach", in which data are wholly extracted before the integration starts without any mutual dependency between the two tasks. In this paper, we focus on data that are exposed by structured and redundant web sources. We introduce novel polynomial algorithms to solve the stated problems and present theoretical results on the properties of the solution generated by our approach. Finally, a preliminary experimental evaluation show the applicability of our model with real-world websites.

Original languageEnglish (US)
JournalCEUR Workshop Proceedings
Volume880
StatePublished - Dec 1 2011
Event1st International Workshop on Searching and Integrating New Web Data Sources - Very Large Data Search, VLDS 2011 - Seattle, WA, United States
Duration: Sep 2 2011Sep 2 2011

ASJC Scopus subject areas

  • Computer Science(all)

Fingerprint Dive into the research topics of 'Contextual data extraction and instancebased integration'. Together they form a unique fingerprint.

Cite this