TY - GEN
T1 - Redundancy-driven web data extraction and integration
AU - Blanco, Lorenzo
AU - Bronzi, Mirko
AU - Crescenzi, Valter
AU - Merialdo, Paolo
AU - Papotti, Paolo
PY - 2010
Y1 - 2010
N2 - A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages conffrm the feasibility and quality of the approach.
AB - A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages conffrm the feasibility and quality of the approach.
UR - http://www.scopus.com/inward/record.url?scp=78650432881&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78650432881&partnerID=8YFLogxK
U2 - 10.1145/1859127.1859137
DO - 10.1145/1859127.1859137
M3 - Conference contribution
AN - SCOPUS:78650432881
SN - 9781450301862
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
BT - Proceedings of the 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010
PB - Association for Computing Machinery
T2 - 13th International Workshop on the Web and Databases, WebDB 2010, Co-located with ACM SIGMOD 2010
Y2 - 6 June 2010 through 6 June 2010
ER -