TY - GEN
T1 - The similarity join database operator
AU - Silva, Yasin N.
AU - Aref, Walid G.
AU - Ali, Mohamed H.
PY - 2010
Y1 - 2010
N2 - Similarity joins have been studied as key operations in multiple application domains, e.g., record linkage, data cleaning, multimedia and video applications, and phenomena detection on sensor networks. Multiple similarity join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Unfortunately, there has not been much study on the role and implementation of similarity joins as database physical operators. In this paper, we focus on the study of similarity joins as first-class database operators. We present the definition of several similarity join operators and study the way they interact among themselves, with other standard database operators, and with other previously proposed similarity-aware operators. In particular, we present multiple transformation rules that enable similarity query optimization through the generation of equivalent similarity query execution plans. We then describe an efficient implementation of two similarity join operators, ε-Join and Join-Around, as core DBMS operators. The performance evaluation of the implemented operators in PostgreSQL shows that they have good execution time and scalability properties. The execution time of Join-Around is less than 5% of the one of the equivalent query that uses only regular operators while ε-Join's execution time is 20 to 90% of the one of its equivalent regular operators based query for the useful case of small ε (0.01% to 10% of the domain range). We also show experimentally that the proposed transformation rules can generate plans with execution times that are only 10% to 70% of the ones of the initial query plans.
AB - Similarity joins have been studied as key operations in multiple application domains, e.g., record linkage, data cleaning, multimedia and video applications, and phenomena detection on sensor networks. Multiple similarity join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Unfortunately, there has not been much study on the role and implementation of similarity joins as database physical operators. In this paper, we focus on the study of similarity joins as first-class database operators. We present the definition of several similarity join operators and study the way they interact among themselves, with other standard database operators, and with other previously proposed similarity-aware operators. In particular, we present multiple transformation rules that enable similarity query optimization through the generation of equivalent similarity query execution plans. We then describe an efficient implementation of two similarity join operators, ε-Join and Join-Around, as core DBMS operators. The performance evaluation of the implemented operators in PostgreSQL shows that they have good execution time and scalability properties. The execution time of Join-Around is less than 5% of the one of the equivalent query that uses only regular operators while ε-Join's execution time is 20 to 90% of the one of its equivalent regular operators based query for the useful case of small ε (0.01% to 10% of the domain range). We also show experimentally that the proposed transformation rules can generate plans with execution times that are only 10% to 70% of the ones of the initial query plans.
UR - http://www.scopus.com/inward/record.url?scp=77952772124&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77952772124&partnerID=8YFLogxK
U2 - 10.1109/ICDE.201.5447873
DO - 10.1109/ICDE.201.5447873
M3 - Conference contribution
AN - SCOPUS:77952772124
SN - 9781424454440
T3 - Proceedings - International Conference on Data Engineering
SP - 892
EP - 903
BT - 26th IEEE International Conference on Data Engineering, ICDE 2010 - Conference Proceedings
T2 - 26th IEEE International Conference on Data Engineering, ICDE 2010
Y2 - 1 March 2010 through 6 March 2010
ER -