Similarity joins

Their implementation and interactions with other database operators

Yasin Silva, Spencer S. Pearson, Jaime Chon, Ryan Roberts

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

Abstract Similarity Joins are extensively used in multiple application domains and are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several standalone implementations have been proposed, very little work has addressed the implementation of Similarity Joins as physical database operators. In this paper, we focus on the study, design, implementation, and optimization of a Similarity Join database operator for metric spaces. We present DBSimJoin, a physical database operator that integrates techniques to: enable a non-blocking behavior, prioritize the early generation of results, and fully support the database iterator interface. The proposed operator can be used with multiple distance functions and data types. We describe the changes in each query engine module to implement DBSimJoin and provide details of our implementation in PostgreSQL. We also study ways in which DBSimJoin can be combined with other similarity and non-similarity operators to answer more complex queries, and how DBSimJoin can be used in query transformation rules to improve query performance. The extensive performance evaluation shows that DBSimJoin significantly outperforms alternative approaches and scales very well when important parameters like ε, data size, and number of dimensions increase.

Original languageEnglish (US)
Article number1011
Pages (from-to)149-162
Number of pages14
JournalInformation Systems
Volume52
DOIs
StatePublished - Jun 1 2015

Fingerprint

Mathematical operators
Engines

Keywords

  • Database operator
  • PostgreSQL
  • Query processing and optimization
  • Similarity Join
  • Similarity queries

ASJC Scopus subject areas

  • Hardware and Architecture
  • Information Systems
  • Software

Cite this

Similarity joins : Their implementation and interactions with other database operators. / Silva, Yasin; Pearson, Spencer S.; Chon, Jaime; Roberts, Ryan.

In: Information Systems, Vol. 52, 1011, 01.06.2015, p. 149-162.

Research output: Contribution to journalArticle

Silva, Yasin ; Pearson, Spencer S. ; Chon, Jaime ; Roberts, Ryan. / Similarity joins : Their implementation and interactions with other database operators. In: Information Systems. 2015 ; Vol. 52. pp. 149-162.
@article{a7a118268c824708b5b374a0832dbd02,
title = "Similarity joins: Their implementation and interactions with other database operators",
abstract = "Abstract Similarity Joins are extensively used in multiple application domains and are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several standalone implementations have been proposed, very little work has addressed the implementation of Similarity Joins as physical database operators. In this paper, we focus on the study, design, implementation, and optimization of a Similarity Join database operator for metric spaces. We present DBSimJoin, a physical database operator that integrates techniques to: enable a non-blocking behavior, prioritize the early generation of results, and fully support the database iterator interface. The proposed operator can be used with multiple distance functions and data types. We describe the changes in each query engine module to implement DBSimJoin and provide details of our implementation in PostgreSQL. We also study ways in which DBSimJoin can be combined with other similarity and non-similarity operators to answer more complex queries, and how DBSimJoin can be used in query transformation rules to improve query performance. The extensive performance evaluation shows that DBSimJoin significantly outperforms alternative approaches and scales very well when important parameters like ε, data size, and number of dimensions increase.",
keywords = "Database operator, PostgreSQL, Query processing and optimization, Similarity Join, Similarity queries",
author = "Yasin Silva and Pearson, {Spencer S.} and Jaime Chon and Ryan Roberts",
year = "2015",
month = "6",
day = "1",
doi = "10.1016/j.is.2015.01.008",
language = "English (US)",
volume = "52",
pages = "149--162",
journal = "Information Systems",
issn = "0306-4379",
publisher = "Elsevier Limited",

}

TY - JOUR

T1 - Similarity joins

T2 - Their implementation and interactions with other database operators

AU - Silva, Yasin

AU - Pearson, Spencer S.

AU - Chon, Jaime

AU - Roberts, Ryan

PY - 2015/6/1

Y1 - 2015/6/1

N2 - Abstract Similarity Joins are extensively used in multiple application domains and are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several standalone implementations have been proposed, very little work has addressed the implementation of Similarity Joins as physical database operators. In this paper, we focus on the study, design, implementation, and optimization of a Similarity Join database operator for metric spaces. We present DBSimJoin, a physical database operator that integrates techniques to: enable a non-blocking behavior, prioritize the early generation of results, and fully support the database iterator interface. The proposed operator can be used with multiple distance functions and data types. We describe the changes in each query engine module to implement DBSimJoin and provide details of our implementation in PostgreSQL. We also study ways in which DBSimJoin can be combined with other similarity and non-similarity operators to answer more complex queries, and how DBSimJoin can be used in query transformation rules to improve query performance. The extensive performance evaluation shows that DBSimJoin significantly outperforms alternative approaches and scales very well when important parameters like ε, data size, and number of dimensions increase.

AB - Abstract Similarity Joins are extensively used in multiple application domains and are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several standalone implementations have been proposed, very little work has addressed the implementation of Similarity Joins as physical database operators. In this paper, we focus on the study, design, implementation, and optimization of a Similarity Join database operator for metric spaces. We present DBSimJoin, a physical database operator that integrates techniques to: enable a non-blocking behavior, prioritize the early generation of results, and fully support the database iterator interface. The proposed operator can be used with multiple distance functions and data types. We describe the changes in each query engine module to implement DBSimJoin and provide details of our implementation in PostgreSQL. We also study ways in which DBSimJoin can be combined with other similarity and non-similarity operators to answer more complex queries, and how DBSimJoin can be used in query transformation rules to improve query performance. The extensive performance evaluation shows that DBSimJoin significantly outperforms alternative approaches and scales very well when important parameters like ε, data size, and number of dimensions increase.

KW - Database operator

KW - PostgreSQL

KW - Query processing and optimization

KW - Similarity Join

KW - Similarity queries

UR - http://www.scopus.com/inward/record.url?scp=84930177364&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84930177364&partnerID=8YFLogxK

U2 - 10.1016/j.is.2015.01.008

DO - 10.1016/j.is.2015.01.008

M3 - Article

VL - 52

SP - 149

EP - 162

JO - Information Systems

JF - Information Systems

SN - 0306-4379

M1 - 1011

ER -