Abstract
Given an appropriate similarity model, the k-nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big high-dimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time.
Original language | English (US) |
---|---|
Article number | 101410 |
Journal | Information Systems |
Volume | 87 |
DOIs | |
State | Published - Jan 1 2020 |
Fingerprint
Keywords
- Approximate similarity join
- Hadoop
- High-dimensional data
- k-NN
- MapReduce
- Spark
ASJC Scopus subject areas
- Software
- Information Systems
- Hardware and Architecture
Cite this
Pivot-based approximate k-NN similarity joins for big high-dimensional data. / Čech, Přemysl; Lokoč, J.; Silva, Yasin.
In: Information Systems, Vol. 87, 101410, 01.01.2020.Research output: Contribution to journal › Article
}
TY - JOUR
T1 - Pivot-based approximate k-NN similarity joins for big high-dimensional data
AU - Čech, Přemysl
AU - Lokoč, J.
AU - Silva, Yasin
PY - 2020/1/1
Y1 - 2020/1/1
N2 - Given an appropriate similarity model, the k-nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big high-dimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time.
AB - Given an appropriate similarity model, the k-nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big high-dimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time.
KW - Approximate similarity join
KW - Hadoop
KW - High-dimensional data
KW - k-NN
KW - MapReduce
KW - Spark
UR - http://www.scopus.com/inward/record.url?scp=85070216906&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85070216906&partnerID=8YFLogxK
U2 - 10.1016/j.is.2019.06.006
DO - 10.1016/j.is.2019.06.006
M3 - Article
AN - SCOPUS:85070216906
VL - 87
JO - Information Systems
JF - Information Systems
SN - 0306-4379
M1 - 101410
ER -