Redundancy-driven web data extraction and integration

Lorenzo Blanco; Mirko Bronzi; Valter Crescenzi; Paolo Merialdo; Paolo Papotti

doi:10.1145/1859127.1859137

Redundancy-driven web data extraction and integration

Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

6 Scopus citations

Abstract

A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages conffrm the feasibility and quality of the approach.

Original language	English (US)
Title of host publication	Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010
Publisher	Association for Computing Machinery
ISBN (Print)	9781450301862
DOIs	https://doi.org/10.1145/1859127.1859137
State	Published - 2010
Externally published	Yes
Event	13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010 - Indianapolis, IN, United States Duration: Jun 6 2010 → Jun 6 2010

Publication series

Name	Proceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)	0730-8078

Other

Other	13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010
Country/Territory	United States
City	Indianapolis, IN
Period	6/6/10 → 6/6/10

ASJC Scopus subject areas

Software
Information Systems

Access to Document

10.1145/1859127.1859137

Cite this

Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., & Papotti, P. (2010). Redundancy-driven web data extraction and integration. In Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010 Article 7 (Proceedings of the ACM SIGMOD International Conference on Management of Data). Association for Computing Machinery. https://doi.org/10.1145/1859127.1859137

Redundancy-driven web data extraction and integration. / Blanco, Lorenzo; Bronzi, Mirko; Crescenzi, Valter et al.
Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010. Association for Computing Machinery, 2010. 7 (Proceedings of the ACM SIGMOD International Conference on Management of Data).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Blanco, L, Bronzi, M, Crescenzi, V, Merialdo, P & Papotti, P 2010, Redundancy-driven web data extraction and integration. in Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010., 7, Proceedings of the ACM SIGMOD International Conference on Management of Data, Association for Computing Machinery, 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010, Indianapolis, IN, United States, 6/6/10. https://doi.org/10.1145/1859127.1859137

Blanco L, Bronzi M, Crescenzi V, Merialdo P, Papotti P. Redundancy-driven web data extraction and integration. In Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010. Association for Computing Machinery. 2010. 7. (Proceedings of the ACM SIGMOD International Conference on Management of Data). doi: 10.1145/1859127.1859137

@inproceedings{f3aa2301fa50491eb946727e2f2de0a6,

title = "Redundancy-driven web data extraction and integration",

abstract = "A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages conffrm the feasibility and quality of the approach.",

author = "Lorenzo Blanco and Mirko Bronzi and Valter Crescenzi and Paolo Merialdo and Paolo Papotti",

year = "2010",

doi = "10.1145/1859127.1859137",

language = "English (US)",

isbn = "9781450301862",

series = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

publisher = "Association for Computing Machinery",

booktitle = "Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010",

note = "13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010 ; Conference date: 06-06-2010 Through 06-06-2010",

}

TY - GEN

T1 - Redundancy-driven web data extraction and integration

AU - Blanco, Lorenzo

AU - Bronzi, Mirko

AU - Crescenzi, Valter

AU - Merialdo, Paolo

AU - Papotti, Paolo

PY - 2010

Y1 - 2010

N2 - A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages conffrm the feasibility and quality of the approach.

AB - A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages conffrm the feasibility and quality of the approach.

UR - http://www.scopus.com/inward/record.url?scp=78650432881&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78650432881&partnerID=8YFLogxK

U2 - 10.1145/1859127.1859137

DO - 10.1145/1859127.1859137

M3 - Conference contribution

AN - SCOPUS:78650432881

SN - 9781450301862

T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

BT - Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010

PB - Association for Computing Machinery

T2 - 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010

Y2 - 6 June 2010 through 6 June 2010

ER -

Redundancy-driven web data extraction and integration

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this