The similarity join database operator

Yasin N. Silva, Walid G. Aref, Mohamed H. Ali

Research output: Chapter in Book/Report/Conference proceedingConference contribution

53 Scopus citations

Abstract

Similarity joins have been studied as key operations in multiple application domains, e.g., record linkage, data cleaning, multimedia and video applications, and phenomena detection on sensor networks. Multiple similarity join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Unfortunately, there has not been much study on the role and implementation of similarity joins as database physical operators. In this paper, we focus on the study of similarity joins as first-class database operators. We present the definition of several similarity join operators and study the way they interact among themselves, with other standard database operators, and with other previously proposed similarity-aware operators. In particular, we present multiple transformation rules that enable similarity query optimization through the generation of equivalent similarity query execution plans. We then describe an efficient implementation of two similarity join operators, ε-Join and Join-Around, as core DBMS operators. The performance evaluation of the implemented operators in PostgreSQL shows that they have good execution time and scalability properties. The execution time of Join-Around is less than 5% of the one of the equivalent query that uses only regular operators while ε-Join's execution time is 20 to 90% of the one of its equivalent regular operators based query for the useful case of small ε (0.01% to 10% of the domain range). We also show experimentally that the proposed transformation rules can generate plans with execution times that are only 10% to 70% of the ones of the initial query plans.

Original languageEnglish (US)
Title of host publication26th IEEE International Conference on Data Engineering, ICDE 2010 - Conference Proceedings
Pages892-903
Number of pages12
DOIs
StatePublished - 2010
Externally publishedYes
Event26th IEEE International Conference on Data Engineering, ICDE 2010 - Long Beach, CA, United States
Duration: Mar 1 2010Mar 6 2010

Publication series

NameProceedings - International Conference on Data Engineering
ISSN (Print)1084-4627

Other

Other26th IEEE International Conference on Data Engineering, ICDE 2010
Country/TerritoryUnited States
CityLong Beach, CA
Period3/1/103/6/10

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Fingerprint

Dive into the research topics of 'The similarity join database operator'. Together they form a unique fingerprint.

Cite this