An experimental survey of mapreduce-based similarity joins

Yasin Silva, Jason Reed, Kyle Brown, Adelbert Wadsworth, Chuitian Rong

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

In recent years, Big Data systems and their main data processing framework-MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.

Original languageEnglish (US)
Title of host publicationSimilarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings
PublisherSpringer Verlag
Pages181-195
Number of pages15
Volume9939 LNCS
ISBN (Print)9783319467580
DOIs
StatePublished - 2016
Event9th International Conference on Similarity Search and Applications, SISAP 2016 - Tokyo, Japan
Duration: Oct 24 2016Oct 26 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9939 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other9th International Conference on Similarity Search and Applications, SISAP 2016
CountryJapan
CityTokyo
Period10/24/1610/26/16

Fingerprint

MapReduce
Join
Distance Function
Open Source
Similarity
Big data
Data analysis
Scenarios
Alternatives
Experimental Results

Keywords

  • Big data systems
  • MapReduce
  • Performance evaluation
  • Similarity joins

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Silva, Y., Reed, J., Brown, K., Wadsworth, A., & Rong, C. (2016). An experimental survey of mapreduce-based similarity joins. In Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings (Vol. 9939 LNCS, pp. 181-195). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9939 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-46759-7_14

An experimental survey of mapreduce-based similarity joins. / Silva, Yasin; Reed, Jason; Brown, Kyle; Wadsworth, Adelbert; Rong, Chuitian.

Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings. Vol. 9939 LNCS Springer Verlag, 2016. p. 181-195 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9939 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Silva, Y, Reed, J, Brown, K, Wadsworth, A & Rong, C 2016, An experimental survey of mapreduce-based similarity joins. in Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings. vol. 9939 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9939 LNCS, Springer Verlag, pp. 181-195, 9th International Conference on Similarity Search and Applications, SISAP 2016, Tokyo, Japan, 10/24/16. https://doi.org/10.1007/978-3-319-46759-7_14
Silva Y, Reed J, Brown K, Wadsworth A, Rong C. An experimental survey of mapreduce-based similarity joins. In Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings. Vol. 9939 LNCS. Springer Verlag. 2016. p. 181-195. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-46759-7_14
Silva, Yasin ; Reed, Jason ; Brown, Kyle ; Wadsworth, Adelbert ; Rong, Chuitian. / An experimental survey of mapreduce-based similarity joins. Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings. Vol. 9939 LNCS Springer Verlag, 2016. pp. 181-195 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{ca3b0860da2640ca9272407cf29b2afd,
title = "An experimental survey of mapreduce-based similarity joins",
abstract = "In recent years, Big Data systems and their main data processing framework-MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.",
keywords = "Big data systems, MapReduce, Performance evaluation, Similarity joins",
author = "Yasin Silva and Jason Reed and Kyle Brown and Adelbert Wadsworth and Chuitian Rong",
year = "2016",
doi = "10.1007/978-3-319-46759-7_14",
language = "English (US)",
isbn = "9783319467580",
volume = "9939 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "181--195",
booktitle = "Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings",
address = "Germany",

}

TY - GEN

T1 - An experimental survey of mapreduce-based similarity joins

AU - Silva, Yasin

AU - Reed, Jason

AU - Brown, Kyle

AU - Wadsworth, Adelbert

AU - Rong, Chuitian

PY - 2016

Y1 - 2016

N2 - In recent years, Big Data systems and their main data processing framework-MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.

AB - In recent years, Big Data systems and their main data processing framework-MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.

KW - Big data systems

KW - MapReduce

KW - Performance evaluation

KW - Similarity joins

UR - http://www.scopus.com/inward/record.url?scp=84989825265&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84989825265&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-46759-7_14

DO - 10.1007/978-3-319-46759-7_14

M3 - Conference contribution

AN - SCOPUS:84989825265

SN - 9783319467580

VL - 9939 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 181

EP - 195

BT - Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings

PB - Springer Verlag

ER -