Index-based R-S similarity joins

Spencer S. Pearson, Yasin Silva

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

Similarity Joins are some of the most useful and powerful data processing operations. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations, such as record linkage, data cleaning, and many other applications. An important method to implement efficient Similarity Joins is the use of indexing structures. The previous work, however, only supports self joins or requires the joint indexing of every pair of relations that participate in a Similarity Join. We present an algorithm that extends a previously proposed indexbased algorithm (eD-Index) to support Similarity Joins over two relations. Our approach operates over individual indices. We evaluate the performance of this algorithm, contrast it with an alternative approach, and investigate the configuration of parameters that maximize performance. Our results show that our algorithm significantly outperforms the alternative one in terms of distance computations, and reveal interesting properties when comparing execution time.

Original languageEnglish (US)
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
PublisherSpringer Verlag
Pages106-112
Number of pages7
Volume8821
ISBN (Print)9783319119878
DOIs
StatePublished - 2014
Event7th International Conference on Similarity Search and Applications, SISAP 2014 - Los Cabos, Mexico
Duration: Oct 29 2014Oct 31 2014

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8821
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other7th International Conference on Similarity Search and Applications, SISAP 2014
CountryMexico
CityLos Cabos
Period10/29/1410/31/14

Fingerprint

Join
Indexing
Record Linkage
Alternatives
Cleaning
Execution Time
Maximise
Similarity
Configuration
Evaluate

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Pearson, S. S., & Silva, Y. (2014). Index-based R-S similarity joins. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8821, pp. 106-112). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8821). Springer Verlag. https://doi.org/10.1007/978-3-319-11988-5_10

Index-based R-S similarity joins. / Pearson, Spencer S.; Silva, Yasin.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 8821 Springer Verlag, 2014. p. 106-112 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8821).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Pearson, SS & Silva, Y 2014, Index-based R-S similarity joins. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 8821, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8821, Springer Verlag, pp. 106-112, 7th International Conference on Similarity Search and Applications, SISAP 2014, Los Cabos, Mexico, 10/29/14. https://doi.org/10.1007/978-3-319-11988-5_10
Pearson SS, Silva Y. Index-based R-S similarity joins. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 8821. Springer Verlag. 2014. p. 106-112. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-11988-5_10
Pearson, Spencer S. ; Silva, Yasin. / Index-based R-S similarity joins. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 8821 Springer Verlag, 2014. pp. 106-112 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{592a2e67b6584e908c21489294a74d6e,
title = "Index-based R-S similarity joins",
abstract = "Similarity Joins are some of the most useful and powerful data processing operations. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations, such as record linkage, data cleaning, and many other applications. An important method to implement efficient Similarity Joins is the use of indexing structures. The previous work, however, only supports self joins or requires the joint indexing of every pair of relations that participate in a Similarity Join. We present an algorithm that extends a previously proposed indexbased algorithm (eD-Index) to support Similarity Joins over two relations. Our approach operates over individual indices. We evaluate the performance of this algorithm, contrast it with an alternative approach, and investigate the configuration of parameters that maximize performance. Our results show that our algorithm significantly outperforms the alternative one in terms of distance computations, and reveal interesting properties when comparing execution time.",
author = "Pearson, {Spencer S.} and Yasin Silva",
year = "2014",
doi = "10.1007/978-3-319-11988-5_10",
language = "English (US)",
isbn = "9783319119878",
volume = "8821",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "106--112",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Index-based R-S similarity joins

AU - Pearson, Spencer S.

AU - Silva, Yasin

PY - 2014

Y1 - 2014

N2 - Similarity Joins are some of the most useful and powerful data processing operations. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations, such as record linkage, data cleaning, and many other applications. An important method to implement efficient Similarity Joins is the use of indexing structures. The previous work, however, only supports self joins or requires the joint indexing of every pair of relations that participate in a Similarity Join. We present an algorithm that extends a previously proposed indexbased algorithm (eD-Index) to support Similarity Joins over two relations. Our approach operates over individual indices. We evaluate the performance of this algorithm, contrast it with an alternative approach, and investigate the configuration of parameters that maximize performance. Our results show that our algorithm significantly outperforms the alternative one in terms of distance computations, and reveal interesting properties when comparing execution time.

AB - Similarity Joins are some of the most useful and powerful data processing operations. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations, such as record linkage, data cleaning, and many other applications. An important method to implement efficient Similarity Joins is the use of indexing structures. The previous work, however, only supports self joins or requires the joint indexing of every pair of relations that participate in a Similarity Join. We present an algorithm that extends a previously proposed indexbased algorithm (eD-Index) to support Similarity Joins over two relations. Our approach operates over individual indices. We evaluate the performance of this algorithm, contrast it with an alternative approach, and investigate the configuration of parameters that maximize performance. Our results show that our algorithm significantly outperforms the alternative one in terms of distance computations, and reveal interesting properties when comparing execution time.

UR - http://www.scopus.com/inward/record.url?scp=84911138796&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84911138796&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-11988-5_10

DO - 10.1007/978-3-319-11988-5_10

M3 - Conference contribution

AN - SCOPUS:84911138796

SN - 9783319119878

VL - 8821

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 106

EP - 112

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

PB - Springer Verlag

ER -