The similarity join database operator

Yasin N. Silva; Walid G. Aref; Mohamed H. Ali

doi:10.1109/ICDE.201.5447873

The similarity join database operator

Yasin N. Silva, Walid G. Aref, Mohamed H. Ali

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

53 Scopus citations

Abstract

Similarity joins have been studied as key operations in multiple application domains, e.g., record linkage, data cleaning, multimedia and video applications, and phenomena detection on sensor networks. Multiple similarity join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Unfortunately, there has not been much study on the role and implementation of similarity joins as database physical operators. In this paper, we focus on the study of similarity joins as first-class database operators. We present the definition of several similarity join operators and study the way they interact among themselves, with other standard database operators, and with other previously proposed similarity-aware operators. In particular, we present multiple transformation rules that enable similarity query optimization through the generation of equivalent similarity query execution plans. We then describe an efficient implementation of two similarity join operators, ε-Join and Join-Around, as core DBMS operators. The performance evaluation of the implemented operators in PostgreSQL shows that they have good execution time and scalability properties. The execution time of Join-Around is less than 5% of the one of the equivalent query that uses only regular operators while ε-Join's execution time is 20 to 90% of the one of its equivalent regular operators based query for the useful case of small ε (0.01% to 10% of the domain range). We also show experimentally that the proposed transformation rules can generate plans with execution times that are only 10% to 70% of the ones of the initial query plans.

Original language	English (US)
Title of host publication	26th IEEE International Conference on Data Engineering, ICDE 2010 - Conference Proceedings
Pages	892-903
Number of pages	12
DOIs	https://doi.org/10.1109/ICDE.201.5447873
State	Published - 2010
Externally published	Yes
Event	26th IEEE International Conference on Data Engineering, ICDE 2010 - Long Beach, CA, United States Duration: Mar 1 2010 → Mar 6 2010

Publication series

Name	Proceedings - International Conference on Data Engineering
ISSN (Print)	1084-4627

Other

Other	26th IEEE International Conference on Data Engineering, ICDE 2010
Country/Territory	United States
City	Long Beach, CA
Period	3/1/10 → 3/6/10

ASJC Scopus subject areas

Software
Signal Processing
Information Systems

Access to Document

10.1109/ICDE.201.5447873

Cite this

Silva, YN, Aref, WG & Ali, MH 2010, The similarity join database operator. in 26th IEEE International Conference on Data Engineering, ICDE 2010 - Conference Proceedings., 5447873, Proceedings - International Conference on Data Engineering, pp. 892-903, 26th IEEE International Conference on Data Engineering, ICDE 2010, Long Beach, CA, United States, 3/1/10. https://doi.org/10.1109/ICDE.201.5447873

@inproceedings{26eebecd948c457ab2bd9929c1a4a564,

title = "The similarity join database operator",

abstract = "Similarity joins have been studied as key operations in multiple application domains, e.g., record linkage, data cleaning, multimedia and video applications, and phenomena detection on sensor networks. Multiple similarity join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Unfortunately, there has not been much study on the role and implementation of similarity joins as database physical operators. In this paper, we focus on the study of similarity joins as first-class database operators. We present the definition of several similarity join operators and study the way they interact among themselves, with other standard database operators, and with other previously proposed similarity-aware operators. In particular, we present multiple transformation rules that enable similarity query optimization through the generation of equivalent similarity query execution plans. We then describe an efficient implementation of two similarity join operators, ε-Join and Join-Around, as core DBMS operators. The performance evaluation of the implemented operators in PostgreSQL shows that they have good execution time and scalability properties. The execution time of Join-Around is less than 5% of the one of the equivalent query that uses only regular operators while ε-Join's execution time is 20 to 90% of the one of its equivalent regular operators based query for the useful case of small ε (0.01% to 10% of the domain range). We also show experimentally that the proposed transformation rules can generate plans with execution times that are only 10% to 70% of the ones of the initial query plans.",

author = "Silva, {Yasin N.} and Aref, {Walid G.} and Ali, {Mohamed H.}",

year = "2010",

doi = "10.1109/ICDE.201.5447873",

language = "English (US)",

isbn = "9781424454440",

series = "Proceedings - International Conference on Data Engineering",

pages = "892--903",

booktitle = "26th IEEE International Conference on Data Engineering, ICDE 2010 - Conference Proceedings",

note = "26th IEEE International Conference on Data Engineering, ICDE 2010 ; Conference date: 01-03-2010 Through 06-03-2010",

}

TY - GEN

T1 - The similarity join database operator

AU - Silva, Yasin N.

AU - Aref, Walid G.

AU - Ali, Mohamed H.

PY - 2010

Y1 - 2010

N2 - Similarity joins have been studied as key operations in multiple application domains, e.g., record linkage, data cleaning, multimedia and video applications, and phenomena detection on sensor networks. Multiple similarity join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Unfortunately, there has not been much study on the role and implementation of similarity joins as database physical operators. In this paper, we focus on the study of similarity joins as first-class database operators. We present the definition of several similarity join operators and study the way they interact among themselves, with other standard database operators, and with other previously proposed similarity-aware operators. In particular, we present multiple transformation rules that enable similarity query optimization through the generation of equivalent similarity query execution plans. We then describe an efficient implementation of two similarity join operators, ε-Join and Join-Around, as core DBMS operators. The performance evaluation of the implemented operators in PostgreSQL shows that they have good execution time and scalability properties. The execution time of Join-Around is less than 5% of the one of the equivalent query that uses only regular operators while ε-Join's execution time is 20 to 90% of the one of its equivalent regular operators based query for the useful case of small ε (0.01% to 10% of the domain range). We also show experimentally that the proposed transformation rules can generate plans with execution times that are only 10% to 70% of the ones of the initial query plans.

AB - Similarity joins have been studied as key operations in multiple application domains, e.g., record linkage, data cleaning, multimedia and video applications, and phenomena detection on sensor networks. Multiple similarity join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Unfortunately, there has not been much study on the role and implementation of similarity joins as database physical operators. In this paper, we focus on the study of similarity joins as first-class database operators. We present the definition of several similarity join operators and study the way they interact among themselves, with other standard database operators, and with other previously proposed similarity-aware operators. In particular, we present multiple transformation rules that enable similarity query optimization through the generation of equivalent similarity query execution plans. We then describe an efficient implementation of two similarity join operators, ε-Join and Join-Around, as core DBMS operators. The performance evaluation of the implemented operators in PostgreSQL shows that they have good execution time and scalability properties. The execution time of Join-Around is less than 5% of the one of the equivalent query that uses only regular operators while ε-Join's execution time is 20 to 90% of the one of its equivalent regular operators based query for the useful case of small ε (0.01% to 10% of the domain range). We also show experimentally that the proposed transformation rules can generate plans with execution times that are only 10% to 70% of the ones of the initial query plans.

UR - http://www.scopus.com/inward/record.url?scp=77952772124&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77952772124&partnerID=8YFLogxK

U2 - 10.1109/ICDE.201.5447873

DO - 10.1109/ICDE.201.5447873

M3 - Conference contribution

AN - SCOPUS:77952772124

SN - 9781424454440

T3 - Proceedings - International Conference on Data Engineering

SP - 892

EP - 903

BT - 26th IEEE International Conference on Data Engineering, ICDE 2010 - Conference Proceedings

T2 - 26th IEEE International Conference on Data Engineering, ICDE 2010

Y2 - 1 March 2010 through 6 March 2010

ER -

The similarity join database operator

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this