TY - GEN
T1 - An experimental survey of mapreduce-based similarity joins
AU - Silva, Yasin
AU - Reed, Jason
AU - Brown, Kyle
AU - Wadsworth, Adelbert
AU - Rong, Chuitian
PY - 2016
Y1 - 2016
N2 - In recent years, Big Data systems and their main data processing framework-MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.
AB - In recent years, Big Data systems and their main data processing framework-MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.
KW - Big data systems
KW - MapReduce
KW - Performance evaluation
KW - Similarity joins
UR - http://www.scopus.com/inward/record.url?scp=84989825265&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84989825265&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-46759-7_14
DO - 10.1007/978-3-319-46759-7_14
M3 - Conference contribution
AN - SCOPUS:84989825265
SN - 9783319467580
VL - 9939 LNCS
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 181
EP - 195
BT - Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings
PB - Springer Verlag
T2 - 9th International Conference on Similarity Search and Applications, SISAP 2016
Y2 - 24 October 2016 through 26 October 2016
ER -