Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data

Přemysl Čech, Jakub Maroušek, Jakub Lokoč, Yasin Silva, Jeremy Starks

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Similarity joins represent a useful operator for data mining, data analysis and data exploration applications. With the exponential growth of data to be analyzed, distributed approaches like MapReduce are required. So far, the state-of-the-art similarity join approaches based on MapReduce mainly focused on the processing of vector data with less than one hundred dimensions. In this paper, we revisit and investigate the performance of different MapReduce-based approximate k-NN similarity join approaches on Apache Hadoop for large volumes of high-dimensional vector data.

Original languageEnglish (US)
Title of host publicationAdvanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings
PublisherSpringer Verlag
Pages63-75
Number of pages13
Volume10604 LNAI
ISBN (Print)9783319691787
DOIs
StatePublished - 2017
Event13th International Conference on Advanced Data Mining and Applications, ADMA 2017 - Singapore, Singapore
Duration: Nov 5 2017Nov 6 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10604 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other13th International Conference on Advanced Data Mining and Applications, ADMA 2017
CountrySingapore
CitySingapore
Period11/5/1711/6/17

Fingerprint

MapReduce
High-dimensional Data
Join
Data mining
Exponential Growth
Processing
One Dimension
Data analysis
Data Mining
High-dimensional
Similarity
Operator

Keywords

  • Approximate similarity join
  • Hadoop
  • HTTPS data
  • K-NN
  • MapReduce

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Čech, P., Maroušek, J., Lokoč, J., Silva, Y., & Starks, J. (2017). Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data. In Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings (Vol. 10604 LNAI, pp. 63-75). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10604 LNAI). Springer Verlag. https://doi.org/10.1007/978-3-319-69179-4_5

Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data. / Čech, Přemysl; Maroušek, Jakub; Lokoč, Jakub; Silva, Yasin; Starks, Jeremy.

Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings. Vol. 10604 LNAI Springer Verlag, 2017. p. 63-75 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10604 LNAI).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Čech, P, Maroušek, J, Lokoč, J, Silva, Y & Starks, J 2017, Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data. in Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings. vol. 10604 LNAI, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10604 LNAI, Springer Verlag, pp. 63-75, 13th International Conference on Advanced Data Mining and Applications, ADMA 2017, Singapore, Singapore, 11/5/17. https://doi.org/10.1007/978-3-319-69179-4_5
Čech P, Maroušek J, Lokoč J, Silva Y, Starks J. Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data. In Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings. Vol. 10604 LNAI. Springer Verlag. 2017. p. 63-75. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-69179-4_5
Čech, Přemysl ; Maroušek, Jakub ; Lokoč, Jakub ; Silva, Yasin ; Starks, Jeremy. / Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data. Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings. Vol. 10604 LNAI Springer Verlag, 2017. pp. 63-75 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{44c8640190ea45d081f47df5f51795f6,
title = "Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data",
abstract = "Similarity joins represent a useful operator for data mining, data analysis and data exploration applications. With the exponential growth of data to be analyzed, distributed approaches like MapReduce are required. So far, the state-of-the-art similarity join approaches based on MapReduce mainly focused on the processing of vector data with less than one hundred dimensions. In this paper, we revisit and investigate the performance of different MapReduce-based approximate k-NN similarity join approaches on Apache Hadoop for large volumes of high-dimensional vector data.",
keywords = "Approximate similarity join, Hadoop, HTTPS data, K-NN, MapReduce",
author = "Přemysl Čech and Jakub Maroušek and Jakub Lokoč and Yasin Silva and Jeremy Starks",
year = "2017",
doi = "10.1007/978-3-319-69179-4_5",
language = "English (US)",
isbn = "9783319691787",
volume = "10604 LNAI",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "63--75",
booktitle = "Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings",
address = "Germany",

}

TY - GEN

T1 - Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data

AU - Čech, Přemysl

AU - Maroušek, Jakub

AU - Lokoč, Jakub

AU - Silva, Yasin

AU - Starks, Jeremy

PY - 2017

Y1 - 2017

N2 - Similarity joins represent a useful operator for data mining, data analysis and data exploration applications. With the exponential growth of data to be analyzed, distributed approaches like MapReduce are required. So far, the state-of-the-art similarity join approaches based on MapReduce mainly focused on the processing of vector data with less than one hundred dimensions. In this paper, we revisit and investigate the performance of different MapReduce-based approximate k-NN similarity join approaches on Apache Hadoop for large volumes of high-dimensional vector data.

AB - Similarity joins represent a useful operator for data mining, data analysis and data exploration applications. With the exponential growth of data to be analyzed, distributed approaches like MapReduce are required. So far, the state-of-the-art similarity join approaches based on MapReduce mainly focused on the processing of vector data with less than one hundred dimensions. In this paper, we revisit and investigate the performance of different MapReduce-based approximate k-NN similarity join approaches on Apache Hadoop for large volumes of high-dimensional vector data.

KW - Approximate similarity join

KW - Hadoop

KW - HTTPS data

KW - K-NN

KW - MapReduce

UR - http://www.scopus.com/inward/record.url?scp=85033707567&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85033707567&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-69179-4_5

DO - 10.1007/978-3-319-69179-4_5

M3 - Conference contribution

SN - 9783319691787

VL - 10604 LNAI

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 63

EP - 75

BT - Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings

PB - Springer Verlag

ER -