Index-based R-S similarity joins

Spencer S. Pearson, Yasin Silva

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Scopus citations

Abstract

Similarity Joins are some of the most useful and powerful data processing operations. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations, such as record linkage, data cleaning, and many other applications. An important method to implement efficient Similarity Joins is the use of indexing structures. The previous work, however, only supports self joins or requires the joint indexing of every pair of relations that participate in a Similarity Join. We present an algorithm that extends a previously proposed indexbased algorithm (eD-Index) to support Similarity Joins over two relations. Our approach operates over individual indices. We evaluate the performance of this algorithm, contrast it with an alternative approach, and investigate the configuration of parameters that maximize performance. Our results show that our algorithm significantly outperforms the alternative one in terms of distance computations, and reveal interesting properties when comparing execution time.

Original languageEnglish (US)
Title of host publicationSimilarity Search and Applications - 7th International Conference, SISAP 2014, Proceedings
EditorsAgma Juci Machado Traina, Caetano Traina, Robson Leonardo Ferreira Cordeiro
PublisherSpringer Verlag
Pages106-112
Number of pages7
ISBN (Electronic)9783319119878
DOIs
StatePublished - Jan 1 2014
Event7th International Conference on Similarity Search and Applications, SISAP 2014 - Los Cabos, Mexico
Duration: Oct 29 2014Oct 31 2014

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8821
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other7th International Conference on Similarity Search and Applications, SISAP 2014
Country/TerritoryMexico
CityLos Cabos
Period10/29/1410/31/14

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Index-based R-S similarity joins'. Together they form a unique fingerprint.

Cite this