Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data

Přemysl Čech; Jakub Maroušek; Jakub Lokoč; Yasin Silva; Jeremy Starks

doi:10.1007/978-3-319-69179-4_5

Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data

Přemysl Čech, Jakub Maroušek, Jakub Lokoč, Yasin Silva, Jeremy Starks

Arizona State University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

9 Scopus citations

Abstract

Similarity joins represent a useful operator for data mining, data analysis and data exploration applications. With the exponential growth of data to be analyzed, distributed approaches like MapReduce are required. So far, the state-of-the-art similarity join approaches based on MapReduce mainly focused on the processing of vector data with less than one hundred dimensions. In this paper, we revisit and investigate the performance of different MapReduce-based approximate k-NN similarity join approaches on Apache Hadoop for large volumes of high-dimensional vector data.

Original language	English (US)
Title of host publication	Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings
Editors	Wen-Chih Peng, Wei Emma Zhang, Gao Cong, Aixin Sun, Chengliang Li
Publisher	Springer Verlag
Pages	63-75
Number of pages	13
ISBN (Print)	9783319691787
DOIs	https://doi.org/10.1007/978-3-319-69179-4_5
State	Published - 2017
Event	13th International Conference on Advanced Data Mining and Applications, ADMA 2017 - Singapore, Singapore Duration: Nov 5 2017 → Nov 6 2017

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	10604 LNAI
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Other

Other	13th International Conference on Advanced Data Mining and Applications, ADMA 2017
Country/Territory	Singapore
City	Singapore
Period	11/5/17 → 11/6/17

Keywords

Approximate similarity join
HTTPS data
Hadoop
K-NN
MapReduce

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-319-69179-4_5

Cite this

Čech, P., Maroušek, J., Lokoč, J., Silva, Y., & Starks, J. (2017). Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data. In W.-C. Peng, W. E. Zhang, G. Cong, A. Sun, & C. Li (Eds.), Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings (pp. 63-75). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10604 LNAI). Springer Verlag. https://doi.org/10.1007/978-3-319-69179-4_5

Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data. / Čech, Přemysl; Maroušek, Jakub; Lokoč, Jakub et al.
Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings. ed. / Wen-Chih Peng; Wei Emma Zhang; Gao Cong; Aixin Sun; Chengliang Li. Springer Verlag, 2017. p. 63-75 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10604 LNAI).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Čech, P, Maroušek, J, Lokoč, J, Silva, Y & Starks, J 2017, Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data. in W-C Peng, WE Zhang, G Cong, A Sun & C Li (eds), Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10604 LNAI, Springer Verlag, pp. 63-75, 13th International Conference on Advanced Data Mining and Applications, ADMA 2017, Singapore, Singapore, 11/5/17. https://doi.org/10.1007/978-3-319-69179-4_5

Čech P, Maroušek J, Lokoč J, Silva Y, Starks J. Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data. In Peng WC, Zhang WE, Cong G, Sun A, Li C, editors, Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings. Springer Verlag. 2017. p. 63-75. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-69179-4_5

Čech, Přemysl ; Maroušek, Jakub ; Lokoč, Jakub et al. / Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data. Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings. editor / Wen-Chih Peng ; Wei Emma Zhang ; Gao Cong ; Aixin Sun ; Chengliang Li. Springer Verlag, 2017. pp. 63-75 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{44c8640190ea45d081f47df5f51795f6,

title = "Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data",

abstract = "Similarity joins represent a useful operator for data mining, data analysis and data exploration applications. With the exponential growth of data to be analyzed, distributed approaches like MapReduce are required. So far, the state-of-the-art similarity join approaches based on MapReduce mainly focused on the processing of vector data with less than one hundred dimensions. In this paper, we revisit and investigate the performance of different MapReduce-based approximate k-NN similarity join approaches on Apache Hadoop for large volumes of high-dimensional vector data.",

keywords = "Approximate similarity join, HTTPS data, Hadoop, K-NN, MapReduce",

author = "P{\v r}emysl {\v C}ech and Jakub Marou{\v s}ek and Jakub Loko{\v c} and Yasin Silva and Jeremy Starks",

note = "Funding Information: This project was supported by the GA{\v C}R 15-08916S and GAUK Publisher Copyright: {\textcopyright} Springer International Publishing AG 2017.; 13th International Conference on Advanced Data Mining and Applications, ADMA 2017 ; Conference date: 05-11-2017 Through 06-11-2017",

year = "2017",

doi = "10.1007/978-3-319-69179-4_5",

language = "English (US)",

isbn = "9783319691787",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "63--75",

editor = "Wen-Chih Peng and Zhang, {Wei Emma} and Gao Cong and Aixin Sun and Chengliang Li",

booktitle = "Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings",

}

TY - GEN

T1 - Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data

AU - Čech, Přemysl

AU - Maroušek, Jakub

AU - Lokoč, Jakub

AU - Silva, Yasin

AU - Starks, Jeremy

N1 - Funding Information: This project was supported by the GAČR 15-08916S and GAUK Publisher Copyright: © Springer International Publishing AG 2017.

PY - 2017

Y1 - 2017

N2 - Similarity joins represent a useful operator for data mining, data analysis and data exploration applications. With the exponential growth of data to be analyzed, distributed approaches like MapReduce are required. So far, the state-of-the-art similarity join approaches based on MapReduce mainly focused on the processing of vector data with less than one hundred dimensions. In this paper, we revisit and investigate the performance of different MapReduce-based approximate k-NN similarity join approaches on Apache Hadoop for large volumes of high-dimensional vector data.

AB - Similarity joins represent a useful operator for data mining, data analysis and data exploration applications. With the exponential growth of data to be analyzed, distributed approaches like MapReduce are required. So far, the state-of-the-art similarity join approaches based on MapReduce mainly focused on the processing of vector data with less than one hundred dimensions. In this paper, we revisit and investigate the performance of different MapReduce-based approximate k-NN similarity join approaches on Apache Hadoop for large volumes of high-dimensional vector data.

KW - Approximate similarity join

KW - HTTPS data

KW - Hadoop

KW - K-NN

KW - MapReduce

UR - http://www.scopus.com/inward/record.url?scp=85033707567&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85033707567&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-69179-4_5

DO - 10.1007/978-3-319-69179-4_5

M3 - Conference contribution

AN - SCOPUS:85033707567

SN - 9783319691787

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 63

EP - 75

BT - Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings

A2 - Peng, Wen-Chih

A2 - Zhang, Wei Emma

A2 - Cong, Gao

A2 - Sun, Aixin

A2 - Li, Chengliang

PB - Springer Verlag

T2 - 13th International Conference on Advanced Data Mining and Applications, ADMA 2017

Y2 - 5 November 2017 through 6 November 2017

ER -

Comparing mapreduce-based k-NN similarity joins on hadoop for high-dimensional data

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this