Index-based R-S similarity joins

Spencer S. Pearson; Yasin Silva

doi:10.1007/978-3-319-11988-5_10

Index-based R-S similarity joins

Spencer S. Pearson, Yasin Silva

Arizona State University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

11 Scopus citations

Abstract

Similarity Joins are some of the most useful and powerful data processing operations. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations, such as record linkage, data cleaning, and many other applications. An important method to implement efficient Similarity Joins is the use of indexing structures. The previous work, however, only supports self joins or requires the joint indexing of every pair of relations that participate in a Similarity Join. We present an algorithm that extends a previously proposed indexbased algorithm (eD-Index) to support Similarity Joins over two relations. Our approach operates over individual indices. We evaluate the performance of this algorithm, contrast it with an alternative approach, and investigate the configuration of parameters that maximize performance. Our results show that our algorithm significantly outperforms the alternative one in terms of distance computations, and reveal interesting properties when comparing execution time.

Original language	English (US)
Title of host publication	Similarity Search and Applications - 7th International Conference, SISAP 2014, Proceedings
Editors	Agma Juci Machado Traina, Caetano Traina, Robson Leonardo Ferreira Cordeiro
Publisher	Springer Verlag
Pages	106-112
Number of pages	7
ISBN (Electronic)	9783319119878
DOIs	https://doi.org/10.1007/978-3-319-11988-5_10
State	Published - Jan 1 2014
Event	7th International Conference on Similarity Search and Applications, SISAP 2014 - Los Cabos, Mexico Duration: Oct 29 2014 → Oct 31 2014

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	8821
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Other

Other	7th International Conference on Similarity Search and Applications, SISAP 2014
Country/Territory	Mexico
City	Los Cabos
Period	10/29/14 → 10/31/14

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-319-11988-5_10

Cite this

Pearson, S. S., & Silva, Y. (2014). Index-based R-S similarity joins. In A. J. M. Traina, C. Traina, & R. L. F. Cordeiro (Eds.), Similarity Search and Applications - 7th International Conference, SISAP 2014, Proceedings (pp. 106-112). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8821). Springer Verlag. https://doi.org/10.1007/978-3-319-11988-5_10

Index-based R-S similarity joins. / Pearson, Spencer S.; Silva, Yasin.
Similarity Search and Applications - 7th International Conference, SISAP 2014, Proceedings. ed. / Agma Juci Machado Traina; Caetano Traina; Robson Leonardo Ferreira Cordeiro. Springer Verlag, 2014. p. 106-112 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8821).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Pearson, SS & Silva, Y 2014, Index-based R-S similarity joins. in AJM Traina, C Traina & RLF Cordeiro (eds), Similarity Search and Applications - 7th International Conference, SISAP 2014, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8821, Springer Verlag, pp. 106-112, 7th International Conference on Similarity Search and Applications, SISAP 2014, Los Cabos, Mexico, 10/29/14. https://doi.org/10.1007/978-3-319-11988-5_10

Pearson SS, Silva Y. Index-based R-S similarity joins. In Traina AJM, Traina C, Cordeiro RLF, editors, Similarity Search and Applications - 7th International Conference, SISAP 2014, Proceedings. Springer Verlag. 2014. p. 106-112. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-11988-5_10

Pearson, Spencer S. ; Silva, Yasin. / Index-based R-S similarity joins. Similarity Search and Applications - 7th International Conference, SISAP 2014, Proceedings. editor / Agma Juci Machado Traina ; Caetano Traina ; Robson Leonardo Ferreira Cordeiro. Springer Verlag, 2014. pp. 106-112 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{592a2e67b6584e908c21489294a74d6e,

title = "Index-based R-S similarity joins",

abstract = "Similarity Joins are some of the most useful and powerful data processing operations. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations, such as record linkage, data cleaning, and many other applications. An important method to implement efficient Similarity Joins is the use of indexing structures. The previous work, however, only supports self joins or requires the joint indexing of every pair of relations that participate in a Similarity Join. We present an algorithm that extends a previously proposed indexbased algorithm (eD-Index) to support Similarity Joins over two relations. Our approach operates over individual indices. We evaluate the performance of this algorithm, contrast it with an alternative approach, and investigate the configuration of parameters that maximize performance. Our results show that our algorithm significantly outperforms the alternative one in terms of distance computations, and reveal interesting properties when comparing execution time.",

author = "Pearson, {Spencer S.} and Yasin Silva",

year = "2014",

month = jan,

day = "1",

doi = "10.1007/978-3-319-11988-5_10",

language = "English (US)",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "106--112",

editor = "Traina, {Agma Juci Machado} and Caetano Traina and Cordeiro, {Robson Leonardo Ferreira}",

booktitle = "Similarity Search and Applications - 7th International Conference, SISAP 2014, Proceedings",

note = "7th International Conference on Similarity Search and Applications, SISAP 2014 ; Conference date: 29-10-2014 Through 31-10-2014",

}

TY - GEN

T1 - Index-based R-S similarity joins

AU - Pearson, Spencer S.

AU - Silva, Yasin

PY - 2014/1/1

Y1 - 2014/1/1

N2 - Similarity Joins are some of the most useful and powerful data processing operations. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations, such as record linkage, data cleaning, and many other applications. An important method to implement efficient Similarity Joins is the use of indexing structures. The previous work, however, only supports self joins or requires the joint indexing of every pair of relations that participate in a Similarity Join. We present an algorithm that extends a previously proposed indexbased algorithm (eD-Index) to support Similarity Joins over two relations. Our approach operates over individual indices. We evaluate the performance of this algorithm, contrast it with an alternative approach, and investigate the configuration of parameters that maximize performance. Our results show that our algorithm significantly outperforms the alternative one in terms of distance computations, and reveal interesting properties when comparing execution time.

AB - Similarity Joins are some of the most useful and powerful data processing operations. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations, such as record linkage, data cleaning, and many other applications. An important method to implement efficient Similarity Joins is the use of indexing structures. The previous work, however, only supports self joins or requires the joint indexing of every pair of relations that participate in a Similarity Join. We present an algorithm that extends a previously proposed indexbased algorithm (eD-Index) to support Similarity Joins over two relations. Our approach operates over individual indices. We evaluate the performance of this algorithm, contrast it with an alternative approach, and investigate the configuration of parameters that maximize performance. Our results show that our algorithm significantly outperforms the alternative one in terms of distance computations, and reveal interesting properties when comparing execution time.

UR - http://www.scopus.com/inward/record.url?scp=84911138796&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84911138796&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-11988-5_10

DO - 10.1007/978-3-319-11988-5_10

M3 - Conference contribution

AN - SCOPUS:84911138796

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 106

EP - 112

BT - Similarity Search and Applications - 7th International Conference, SISAP 2014, Proceedings

A2 - Traina, Agma Juci Machado

A2 - Traina, Caetano

A2 - Cordeiro, Robson Leonardo Ferreira

PB - Springer Verlag

T2 - 7th International Conference on Similarity Search and Applications, SISAP 2014

Y2 - 29 October 2014 through 31 October 2014

ER -

Index-based R-S similarity joins

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this