An experimental survey of mapreduce-based similarity joins

Yasin Silva; Jason Reed; Kyle Brown; Adelbert Wadsworth; Chuitian Rong

doi:10.1007/978-3-319-46759-7_14

An experimental survey of mapreduce-based similarity joins

Yasin Silva, Jason Reed, Kyle Brown, Adelbert Wadsworth, Chuitian Rong

Arizona State University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

10 Scopus citations

Abstract

In recent years, Big Data systems and their main data processing framework-MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.

Original language	English (US)
Title of host publication	Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings
Publisher	Springer Verlag
Pages	181-195
Number of pages	15
Volume	9939 LNCS
ISBN (Print)	9783319467580
DOIs	https://doi.org/10.1007/978-3-319-46759-7_14
State	Published - 2016
Event	9th International Conference on Similarity Search and Applications, SISAP 2016 - Tokyo, Japan Duration: Oct 24 2016 → Oct 26 2016

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	9939 LNCS
ISSN (Print)	03029743
ISSN (Electronic)	16113349

Other

Other	9th International Conference on Similarity Search and Applications, SISAP 2016
Country/Territory	Japan
City	Tokyo
Period	10/24/16 → 10/26/16

Keywords

Big data systems
MapReduce
Performance evaluation
Similarity joins

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-319-46759-7_14

Cite this

Silva, Y., Reed, J., Brown, K., Wadsworth, A., & Rong, C. (2016). An experimental survey of mapreduce-based similarity joins. In Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings (Vol. 9939 LNCS, pp. 181-195). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9939 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-46759-7_14

An experimental survey of mapreduce-based similarity joins. / Silva, Yasin; Reed, Jason; Brown, Kyle et al.
Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings. Vol. 9939 LNCS Springer Verlag, 2016. p. 181-195 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9939 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Silva, Y, Reed, J, Brown, K, Wadsworth, A & Rong, C 2016, An experimental survey of mapreduce-based similarity joins. in Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings. vol. 9939 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9939 LNCS, Springer Verlag, pp. 181-195, 9th International Conference on Similarity Search and Applications, SISAP 2016, Tokyo, Japan, 10/24/16. https://doi.org/10.1007/978-3-319-46759-7_14

Silva Y, Reed J, Brown K, Wadsworth A, Rong C. An experimental survey of mapreduce-based similarity joins. In Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings. Vol. 9939 LNCS. Springer Verlag. 2016. p. 181-195. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-46759-7_14

Silva, Yasin ; Reed, Jason ; Brown, Kyle et al. / An experimental survey of mapreduce-based similarity joins. Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings. Vol. 9939 LNCS Springer Verlag, 2016. pp. 181-195 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{ca3b0860da2640ca9272407cf29b2afd,

title = "An experimental survey of mapreduce-based similarity joins",

abstract = "In recent years, Big Data systems and their main data processing framework-MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.",

keywords = "Big data systems, MapReduce, Performance evaluation, Similarity joins",

author = "Yasin Silva and Jason Reed and Kyle Brown and Adelbert Wadsworth and Chuitian Rong",

year = "2016",

doi = "10.1007/978-3-319-46759-7_14",

language = "English (US)",

isbn = "9783319467580",

volume = "9939 LNCS",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "181--195",

booktitle = "Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings",

address = "Germany",

note = "9th International Conference on Similarity Search and Applications, SISAP 2016 ; Conference date: 24-10-2016 Through 26-10-2016",

}

TY - GEN

T1 - An experimental survey of mapreduce-based similarity joins

AU - Silva, Yasin

AU - Reed, Jason

AU - Brown, Kyle

AU - Wadsworth, Adelbert

AU - Rong, Chuitian

PY - 2016

Y1 - 2016

N2 - In recent years, Big Data systems and their main data processing framework-MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.

AB - In recent years, Big Data systems and their main data processing framework-MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.

KW - Big data systems

KW - MapReduce

KW - Performance evaluation

KW - Similarity joins

UR - http://www.scopus.com/inward/record.url?scp=84989825265&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84989825265&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-46759-7_14

DO - 10.1007/978-3-319-46759-7_14

M3 - Conference contribution

AN - SCOPUS:84989825265

SN - 9783319467580

VL - 9939 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 181

EP - 195

BT - Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings

PB - Springer Verlag

T2 - 9th International Conference on Similarity Search and Applications, SISAP 2016

Y2 - 24 October 2016 through 26 October 2016

ER -

An experimental survey of mapreduce-based similarity joins

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this