TY - GEN
T1 - Probabilistic models to reconcile complex data from inaccurate data sources
AU - Blanco, Lorenzo
AU - Crescenzi, Valter
AU - Merialdo, Paolo
AU - Papotti, Paolo
PY - 2010
Y1 - 2010
N2 - Several techniques have been developed to extract and integrate data from web sources. However, web data are inherently imprecise and uncertain. This paper addresses the issue of characterizing the uncertainty of data extracted from a number of inaccurate sources. We develop a probabilistic model to compute a probability distribution for the extracted values, and the accuracy of the sources. Our model considers the presence of sources that copy their contents from other sources, and manages the misleading consensus produced by copiers. We extend the models previously proposed in the literature by working on several attributes at a time to better leverage all the available evidence. We also report the results of several experiments on both synthetic and real-life data to show the effectiveness of the proposed approach.
AB - Several techniques have been developed to extract and integrate data from web sources. However, web data are inherently imprecise and uncertain. This paper addresses the issue of characterizing the uncertainty of data extracted from a number of inaccurate sources. We develop a probabilistic model to compute a probability distribution for the extracted values, and the accuracy of the sources. Our model considers the presence of sources that copy their contents from other sources, and manages the misleading consensus produced by copiers. We extend the models previously proposed in the literature by working on several attributes at a time to better leverage all the available evidence. We also report the results of several experiments on both synthetic and real-life data to show the effectiveness of the proposed approach.
UR - http://www.scopus.com/inward/record.url?scp=79955068748&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79955068748&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-13094-6_8
DO - 10.1007/978-3-642-13094-6_8
M3 - Conference contribution
AN - SCOPUS:79955068748
SN - 3642130933
SN - 9783642130939
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 83
EP - 97
BT - Advanced Information Systems Engineering - 22nd International Conference, CAiSE 2010, Proceedings
T2 - 22nd International Conference on Advanced Information Systems Engineering, CAiSE 2010
Y2 - 7 June 2010 through 9 June 2010
ER -