MapReduce-based similarity join for metric spaces

Yasin Silva, Jason M. Reed, Lisa M. Tsosie

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a predefined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper focuses on the study, design and implementation techniques of cloud-based Similarity Joins. We present MRSimJoin, a MapReduce based algorithm to efficiently solve the Similarity Join problem. This algorithm efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. MRSimJoin is general enough to be used with data that lies in any metric space, thus it can be used with multiple data types and distance functions. We present guidelines to implement the algorithm in Hadoop, an open-source cloud system. The experimental evaluation of MRSimJoin shows that it has very good execution time and scalability properties.

Original languageEnglish (US)
Title of host publicationACM International Conference Proceeding Series
DOIs
StatePublished - 2012
Event1st International Workshop on Cloud Intelligence, Cloud-I 2012 - Istanbul, Turkey
Duration: Aug 31 2012Aug 31 2012

Other

Other1st International Workshop on Cloud Intelligence, Cloud-I 2012
CountryTurkey
CityIstanbul
Period8/31/128/31/12

Fingerprint

Scalability

Keywords

  • Hadoop
  • MapReduce
  • Metric space
  • Similarity join

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Cite this

Silva, Y., Reed, J. M., & Tsosie, L. M. (2012). MapReduce-based similarity join for metric spaces. In ACM International Conference Proceeding Series [2347676] https://doi.org/10.1145/2347673.2347676

MapReduce-based similarity join for metric spaces. / Silva, Yasin; Reed, Jason M.; Tsosie, Lisa M.

ACM International Conference Proceeding Series. 2012. 2347676.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Silva, Y, Reed, JM & Tsosie, LM 2012, MapReduce-based similarity join for metric spaces. in ACM International Conference Proceeding Series., 2347676, 1st International Workshop on Cloud Intelligence, Cloud-I 2012, Istanbul, Turkey, 8/31/12. https://doi.org/10.1145/2347673.2347676
Silva Y, Reed JM, Tsosie LM. MapReduce-based similarity join for metric spaces. In ACM International Conference Proceeding Series. 2012. 2347676 https://doi.org/10.1145/2347673.2347676
Silva, Yasin ; Reed, Jason M. ; Tsosie, Lisa M. / MapReduce-based similarity join for metric spaces. ACM International Conference Proceeding Series. 2012.
@inproceedings{074f6d27052643c9b7f9f5cd4d3c5305,
title = "MapReduce-based similarity join for metric spaces",
abstract = "Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a predefined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper focuses on the study, design and implementation techniques of cloud-based Similarity Joins. We present MRSimJoin, a MapReduce based algorithm to efficiently solve the Similarity Join problem. This algorithm efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. MRSimJoin is general enough to be used with data that lies in any metric space, thus it can be used with multiple data types and distance functions. We present guidelines to implement the algorithm in Hadoop, an open-source cloud system. The experimental evaluation of MRSimJoin shows that it has very good execution time and scalability properties.",
keywords = "Hadoop, MapReduce, Metric space, Similarity join",
author = "Yasin Silva and Reed, {Jason M.} and Tsosie, {Lisa M.}",
year = "2012",
doi = "10.1145/2347673.2347676",
language = "English (US)",
isbn = "9781450315968",
booktitle = "ACM International Conference Proceeding Series",

}

TY - GEN

T1 - MapReduce-based similarity join for metric spaces

AU - Silva, Yasin

AU - Reed, Jason M.

AU - Tsosie, Lisa M.

PY - 2012

Y1 - 2012

N2 - Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a predefined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper focuses on the study, design and implementation techniques of cloud-based Similarity Joins. We present MRSimJoin, a MapReduce based algorithm to efficiently solve the Similarity Join problem. This algorithm efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. MRSimJoin is general enough to be used with data that lies in any metric space, thus it can be used with multiple data types and distance functions. We present guidelines to implement the algorithm in Hadoop, an open-source cloud system. The experimental evaluation of MRSimJoin shows that it has very good execution time and scalability properties.

AB - Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a predefined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper focuses on the study, design and implementation techniques of cloud-based Similarity Joins. We present MRSimJoin, a MapReduce based algorithm to efficiently solve the Similarity Join problem. This algorithm efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. MRSimJoin is general enough to be used with data that lies in any metric space, thus it can be used with multiple data types and distance functions. We present guidelines to implement the algorithm in Hadoop, an open-source cloud system. The experimental evaluation of MRSimJoin shows that it has very good execution time and scalability properties.

KW - Hadoop

KW - MapReduce

KW - Metric space

KW - Similarity join

UR - http://www.scopus.com/inward/record.url?scp=84866879166&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84866879166&partnerID=8YFLogxK

U2 - 10.1145/2347673.2347676

DO - 10.1145/2347673.2347676

M3 - Conference contribution

AN - SCOPUS:84866879166

SN - 9781450315968

BT - ACM International Conference Proceeding Series

ER -