Exploiting information redundancy to wring out structured data from the web

Lorenzo Blanco; Mirko Bronzi; Valter Crescenzi; Paolo Merialdo; Paolo Papotti

doi:10.1145/1772690.1772805

Exploiting information redundancy to wring out structured data from the web

Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

6 Scopus citations

Abstract

A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g. financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments confirmed the quality and the feasibility of the approach.

Original language	English (US)
Title of host publication	Proceedings of the 19th International Conference on World Wide Web, WWW '10
Pages	1063-1064
Number of pages	2
DOIs	https://doi.org/10.1145/1772690.1772805
State	Published - 2010
Externally published	Yes
Event	19th International World Wide Web Conference, WWW2010 - Raleigh, NC, United States Duration: Apr 26 2010 → Apr 30 2010

Publication series

Name	Proceedings of the 19th International Conference on World Wide Web, WWW '10

Other

Other	19th International World Wide Web Conference, WWW2010
Country/Territory	United States
City	Raleigh, NC
Period	4/26/10 → 4/30/10

Keywords

data extraction
data integration
wrapper generation

ASJC Scopus subject areas

Computer Networks and Communications
Computer Science Applications

Access to Document

10.1145/1772690.1772805

Cite this

Exploiting information redundancy to wring out structured data from the web. / Blanco, Lorenzo; Bronzi, Mirko; Crescenzi, Valter et al.
Proceedings of the 19th International Conference on World Wide Web, WWW '10. 2010. p. 1063-1064 (Proceedings of the 19th International Conference on World Wide Web, WWW '10).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Blanco, L, Bronzi, M, Crescenzi, V, Merialdo, P & Papotti, P 2010, Exploiting information redundancy to wring out structured data from the web. in Proceedings of the 19th International Conference on World Wide Web, WWW '10. Proceedings of the 19th International Conference on World Wide Web, WWW '10, pp. 1063-1064, 19th International World Wide Web Conference, WWW2010, Raleigh, NC, United States, 4/26/10. https://doi.org/10.1145/1772690.1772805

@inproceedings{a317292952824fd8a3836fb3a5bb9084,

title = "Exploiting information redundancy to wring out structured data from the web",

abstract = "A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g. financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments confirmed the quality and the feasibility of the approach.",

keywords = "data extraction, data integration, wrapper generation",

author = "Lorenzo Blanco and Mirko Bronzi and Valter Crescenzi and Paolo Merialdo and Paolo Papotti",

year = "2010",

doi = "10.1145/1772690.1772805",

language = "English (US)",

isbn = "9781605587998",

series = "Proceedings of the 19th International Conference on World Wide Web, WWW '10",

pages = "1063--1064",

booktitle = "Proceedings of the 19th International Conference on World Wide Web, WWW '10",

note = "19th International World Wide Web Conference, WWW2010 ; Conference date: 26-04-2010 Through 30-04-2010",

}

TY - GEN

T1 - Exploiting information redundancy to wring out structured data from the web

AU - Blanco, Lorenzo

AU - Bronzi, Mirko

AU - Crescenzi, Valter

AU - Merialdo, Paolo

AU - Papotti, Paolo

PY - 2010

Y1 - 2010

N2 - A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g. financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments confirmed the quality and the feasibility of the approach.

AB - A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g. financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments confirmed the quality and the feasibility of the approach.

KW - data extraction

KW - data integration

KW - wrapper generation

UR - http://www.scopus.com/inward/record.url?scp=77954591954&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77954591954&partnerID=8YFLogxK

U2 - 10.1145/1772690.1772805

DO - 10.1145/1772690.1772805

M3 - Conference contribution

AN - SCOPUS:77954591954

SN - 9781605587998

T3 - Proceedings of the 19th International Conference on World Wide Web, WWW '10

SP - 1063

EP - 1064

BT - Proceedings of the 19th International Conference on World Wide Web, WWW '10

T2 - 19th International World Wide Web Conference, WWW2010

Y2 - 26 April 2010 through 30 April 2010

ER -

Exploiting information redundancy to wring out structured data from the web

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this