TY - GEN
T1 - Index-based R-S similarity joins
AU - Pearson, Spencer S.
AU - Silva, Yasin
PY - 2014/1/1
Y1 - 2014/1/1
N2 - Similarity Joins are some of the most useful and powerful data processing operations. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations, such as record linkage, data cleaning, and many other applications. An important method to implement efficient Similarity Joins is the use of indexing structures. The previous work, however, only supports self joins or requires the joint indexing of every pair of relations that participate in a Similarity Join. We present an algorithm that extends a previously proposed indexbased algorithm (eD-Index) to support Similarity Joins over two relations. Our approach operates over individual indices. We evaluate the performance of this algorithm, contrast it with an alternative approach, and investigate the configuration of parameters that maximize performance. Our results show that our algorithm significantly outperforms the alternative one in terms of distance computations, and reveal interesting properties when comparing execution time.
AB - Similarity Joins are some of the most useful and powerful data processing operations. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations, such as record linkage, data cleaning, and many other applications. An important method to implement efficient Similarity Joins is the use of indexing structures. The previous work, however, only supports self joins or requires the joint indexing of every pair of relations that participate in a Similarity Join. We present an algorithm that extends a previously proposed indexbased algorithm (eD-Index) to support Similarity Joins over two relations. Our approach operates over individual indices. We evaluate the performance of this algorithm, contrast it with an alternative approach, and investigate the configuration of parameters that maximize performance. Our results show that our algorithm significantly outperforms the alternative one in terms of distance computations, and reveal interesting properties when comparing execution time.
UR - http://www.scopus.com/inward/record.url?scp=84911138796&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84911138796&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-11988-5_10
DO - 10.1007/978-3-319-11988-5_10
M3 - Conference contribution
AN - SCOPUS:84911138796
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 106
EP - 112
BT - Similarity Search and Applications - 7th International Conference, SISAP 2014, Proceedings
A2 - Traina, Agma Juci Machado
A2 - Traina, Caetano
A2 - Cordeiro, Robson Leonardo Ferreira
PB - Springer Verlag
T2 - 7th International Conference on Similarity Search and Applications, SISAP 2014
Y2 - 29 October 2014 through 31 October 2014
ER -