Exploiting database similarity joins for metric spaces

Yasin Silva, Spencer Pearson

Research output: Chapter in Book/Report/Conference proceedingChapter

13 Citations (Scopus)

Abstract

Similarity Joins are recognized among the most useful data processing and analysis operations and are extensively used in multiple application domains. They retrieve all data pairs whose distances are smaller than a predefined thresh-old ε. Multiple Similarity Join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Recent work has shown that this operation can be efficiently implemented as a physical database operator. However, the proposed operator only support 1D numeric data. This paper presents DBSimJoin, a physical Similarity Join database operator for datasets that lie in any metric space. DBSimJoin is a non-blocking operator that prioritizes the early generation of results. We implemented the proposed operator in PostgreSQL, an open source database system. We show how this operator can be used in multiple real-world data analysis scenarios with multiple data types and distance functions. Particularly, we show the use of DBSimJoin to identify similar images represented as feature vectors, and similar publications in a bibliographic database. We also show that DB-SimJoin scales very well when important parameters, e.g., ε, data size, increase.

Original languageEnglish (US)
Title of host publicationProceedings of the VLDB Endowment
Pages1922-1925
Number of pages4
Volume5
Edition12
StatePublished - Aug 2012

Fingerprint

Data storage equipment

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Silva, Y., & Pearson, S. (2012). Exploiting database similarity joins for metric spaces. In Proceedings of the VLDB Endowment (12 ed., Vol. 5, pp. 1922-1925)

Exploiting database similarity joins for metric spaces. / Silva, Yasin; Pearson, Spencer.

Proceedings of the VLDB Endowment. Vol. 5 12. ed. 2012. p. 1922-1925.

Research output: Chapter in Book/Report/Conference proceedingChapter

Silva, Y & Pearson, S 2012, Exploiting database similarity joins for metric spaces. in Proceedings of the VLDB Endowment. 12 edn, vol. 5, pp. 1922-1925.
Silva Y, Pearson S. Exploiting database similarity joins for metric spaces. In Proceedings of the VLDB Endowment. 12 ed. Vol. 5. 2012. p. 1922-1925
Silva, Yasin ; Pearson, Spencer. / Exploiting database similarity joins for metric spaces. Proceedings of the VLDB Endowment. Vol. 5 12. ed. 2012. pp. 1922-1925
@inbook{e669b7a6f79e42649f3765e90e0892e3,
title = "Exploiting database similarity joins for metric spaces",
abstract = "Similarity Joins are recognized among the most useful data processing and analysis operations and are extensively used in multiple application domains. They retrieve all data pairs whose distances are smaller than a predefined thresh-old ε. Multiple Similarity Join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Recent work has shown that this operation can be efficiently implemented as a physical database operator. However, the proposed operator only support 1D numeric data. This paper presents DBSimJoin, a physical Similarity Join database operator for datasets that lie in any metric space. DBSimJoin is a non-blocking operator that prioritizes the early generation of results. We implemented the proposed operator in PostgreSQL, an open source database system. We show how this operator can be used in multiple real-world data analysis scenarios with multiple data types and distance functions. Particularly, we show the use of DBSimJoin to identify similar images represented as feature vectors, and similar publications in a bibliographic database. We also show that DB-SimJoin scales very well when important parameters, e.g., ε, data size, increase.",
author = "Yasin Silva and Spencer Pearson",
year = "2012",
month = "8",
language = "English (US)",
volume = "5",
pages = "1922--1925",
booktitle = "Proceedings of the VLDB Endowment",
edition = "12",

}

TY - CHAP

T1 - Exploiting database similarity joins for metric spaces

AU - Silva, Yasin

AU - Pearson, Spencer

PY - 2012/8

Y1 - 2012/8

N2 - Similarity Joins are recognized among the most useful data processing and analysis operations and are extensively used in multiple application domains. They retrieve all data pairs whose distances are smaller than a predefined thresh-old ε. Multiple Similarity Join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Recent work has shown that this operation can be efficiently implemented as a physical database operator. However, the proposed operator only support 1D numeric data. This paper presents DBSimJoin, a physical Similarity Join database operator for datasets that lie in any metric space. DBSimJoin is a non-blocking operator that prioritizes the early generation of results. We implemented the proposed operator in PostgreSQL, an open source database system. We show how this operator can be used in multiple real-world data analysis scenarios with multiple data types and distance functions. Particularly, we show the use of DBSimJoin to identify similar images represented as feature vectors, and similar publications in a bibliographic database. We also show that DB-SimJoin scales very well when important parameters, e.g., ε, data size, increase.

AB - Similarity Joins are recognized among the most useful data processing and analysis operations and are extensively used in multiple application domains. They retrieve all data pairs whose distances are smaller than a predefined thresh-old ε. Multiple Similarity Join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Recent work has shown that this operation can be efficiently implemented as a physical database operator. However, the proposed operator only support 1D numeric data. This paper presents DBSimJoin, a physical Similarity Join database operator for datasets that lie in any metric space. DBSimJoin is a non-blocking operator that prioritizes the early generation of results. We implemented the proposed operator in PostgreSQL, an open source database system. We show how this operator can be used in multiple real-world data analysis scenarios with multiple data types and distance functions. Particularly, we show the use of DBSimJoin to identify similar images represented as feature vectors, and similar publications in a bibliographic database. We also show that DB-SimJoin scales very well when important parameters, e.g., ε, data size, increase.

UR - http://www.scopus.com/inward/record.url?scp=84873143143&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84873143143&partnerID=8YFLogxK

M3 - Chapter

AN - SCOPUS:84873143143

VL - 5

SP - 1922

EP - 1925

BT - Proceedings of the VLDB Endowment

ER -