Redundancy-driven web data extraction and integration

Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages conffrm the feasibility and quality of the approach.

Original languageEnglish (US)
Title of host publicationProceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010
DOIs
StatePublished - 2010
Externally publishedYes
Event13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010 - Indianapolis, IN, United States
Duration: Jun 6 2010Jun 6 2010

Other

Other13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010
CountryUnited States
CityIndianapolis, IN
Period6/6/106/6/10

Fingerprint

World Wide Web
Redundancy
Data integration
Websites
Experiments

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., & Papotti, P. (2010). Redundancy-driven web data extraction and integration. In Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010 [7] https://doi.org/10.1145/1859127.1859137

Redundancy-driven web data extraction and integration. / Blanco, Lorenzo; Bronzi, Mirko; Crescenzi, Valter; Merialdo, Paolo; Papotti, Paolo.

Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010. 2010. 7.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Blanco, L, Bronzi, M, Crescenzi, V, Merialdo, P & Papotti, P 2010, Redundancy-driven web data extraction and integration. in Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010., 7, 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010, Indianapolis, IN, United States, 6/6/10. https://doi.org/10.1145/1859127.1859137
Blanco L, Bronzi M, Crescenzi V, Merialdo P, Papotti P. Redundancy-driven web data extraction and integration. In Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010. 2010. 7 https://doi.org/10.1145/1859127.1859137
Blanco, Lorenzo ; Bronzi, Mirko ; Crescenzi, Valter ; Merialdo, Paolo ; Papotti, Paolo. / Redundancy-driven web data extraction and integration. Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010. 2010.
@inproceedings{f3aa2301fa50491eb946727e2f2de0a6,
title = "Redundancy-driven web data extraction and integration",
abstract = "A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages conffrm the feasibility and quality of the approach.",
author = "Lorenzo Blanco and Mirko Bronzi and Valter Crescenzi and Paolo Merialdo and Paolo Papotti",
year = "2010",
doi = "10.1145/1859127.1859137",
language = "English (US)",
isbn = "9781450301862",
booktitle = "Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010",

}

TY - GEN

T1 - Redundancy-driven web data extraction and integration

AU - Blanco, Lorenzo

AU - Bronzi, Mirko

AU - Crescenzi, Valter

AU - Merialdo, Paolo

AU - Papotti, Paolo

PY - 2010

Y1 - 2010

N2 - A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages conffrm the feasibility and quality of the approach.

AB - A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages conffrm the feasibility and quality of the approach.

UR - http://www.scopus.com/inward/record.url?scp=78650432881&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78650432881&partnerID=8YFLogxK

U2 - 10.1145/1859127.1859137

DO - 10.1145/1859127.1859137

M3 - Conference contribution

SN - 9781450301862

BT - Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010

ER -