Exploiting MapReduce-based similarity joins

Yasin Silva; Jason M. Reed

doi:10.1145/2213836.2213935

Exploiting MapReduce-based similarity joins

Yasin Silva, Jason M. Reed

Arizona State University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

42 Scopus citations

Abstract

Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a pre-defined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper presents MRSimJoin, a multi-round MapReduce based algorithm to efficiently solve the Similarity Join problem. MRSimJoin efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. The proposed algorithm is general enough to be used with data that lies in any metric space. We have implemented MRSimJoin in Hadoop, a highly used open-source cloud system. We show how this operation can be used in multiple real-world data analysis scenarios with multiple data types and distance functions. Particularly, we show the use of MRSimJoin to identify similar images represented as feature vectors, and similar publications in a bibliographic database. We also show how MRSimJoin scales in each scenario when important parameters, e.g., ε, data size and number of cluster nodes, increase. We demonstrate the execution of MRSimJoin queries using an Amazon Elastic Compute Cloud (EC2) cluster.

Original language	English (US)
Title of host publication	SIGMOD '12 - Proceedings of the International Conference on Management of Data
Pages	693-696
Number of pages	4
DOIs	https://doi.org/10.1145/2213836.2213935
State	Published - 2012
Event	2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12 - Scottsdale, AZ, United States Duration: May 21 2012 → May 24 2012

Publication series

Name	Proceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)	0730-8078

Conference

Conference	2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12
Country/Territory	United States
City	Scottsdale, AZ
Period	5/21/12 → 5/24/12

Keywords

hadoop
mapreduce
similarity join

ASJC Scopus subject areas

Software
Information Systems

Access to Document

10.1145/2213836.2213935

Cite this

Silva, Y & Reed, JM 2012, Exploiting MapReduce-based similarity joins. in SIGMOD '12 - Proceedings of the International Conference on Management of Data. Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 693-696, 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, Scottsdale, AZ, United States, 5/21/12. https://doi.org/10.1145/2213836.2213935

@inproceedings{ed93146487a34ca3a646d5c1807c2722,

title = "Exploiting MapReduce-based similarity joins",

abstract = "Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a pre-defined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper presents MRSimJoin, a multi-round MapReduce based algorithm to efficiently solve the Similarity Join problem. MRSimJoin efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. The proposed algorithm is general enough to be used with data that lies in any metric space. We have implemented MRSimJoin in Hadoop, a highly used open-source cloud system. We show how this operation can be used in multiple real-world data analysis scenarios with multiple data types and distance functions. Particularly, we show the use of MRSimJoin to identify similar images represented as feature vectors, and similar publications in a bibliographic database. We also show how MRSimJoin scales in each scenario when important parameters, e.g., ε, data size and number of cluster nodes, increase. We demonstrate the execution of MRSimJoin queries using an Amazon Elastic Compute Cloud (EC2) cluster.",

keywords = "hadoop, mapreduce, similarity join",

author = "Yasin Silva and Reed, {Jason M.}",

year = "2012",

doi = "10.1145/2213836.2213935",

language = "English (US)",

isbn = "9781450312479",

series = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

pages = "693--696",

booktitle = "SIGMOD '12 - Proceedings of the International Conference on Management of Data",

note = "2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12 ; Conference date: 21-05-2012 Through 24-05-2012",

}

TY - GEN

T1 - Exploiting MapReduce-based similarity joins

AU - Silva, Yasin

AU - Reed, Jason M.

PY - 2012

Y1 - 2012

N2 - Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a pre-defined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper presents MRSimJoin, a multi-round MapReduce based algorithm to efficiently solve the Similarity Join problem. MRSimJoin efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. The proposed algorithm is general enough to be used with data that lies in any metric space. We have implemented MRSimJoin in Hadoop, a highly used open-source cloud system. We show how this operation can be used in multiple real-world data analysis scenarios with multiple data types and distance functions. Particularly, we show the use of MRSimJoin to identify similar images represented as feature vectors, and similar publications in a bibliographic database. We also show how MRSimJoin scales in each scenario when important parameters, e.g., ε, data size and number of cluster nodes, increase. We demonstrate the execution of MRSimJoin queries using an Amazon Elastic Compute Cloud (EC2) cluster.

AB - Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a pre-defined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper presents MRSimJoin, a multi-round MapReduce based algorithm to efficiently solve the Similarity Join problem. MRSimJoin efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. The proposed algorithm is general enough to be used with data that lies in any metric space. We have implemented MRSimJoin in Hadoop, a highly used open-source cloud system. We show how this operation can be used in multiple real-world data analysis scenarios with multiple data types and distance functions. Particularly, we show the use of MRSimJoin to identify similar images represented as feature vectors, and similar publications in a bibliographic database. We also show how MRSimJoin scales in each scenario when important parameters, e.g., ε, data size and number of cluster nodes, increase. We demonstrate the execution of MRSimJoin queries using an Amazon Elastic Compute Cloud (EC2) cluster.

KW - hadoop

KW - mapreduce

KW - similarity join

UR - http://www.scopus.com/inward/record.url?scp=84862666441&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84862666441&partnerID=8YFLogxK

U2 - 10.1145/2213836.2213935

DO - 10.1145/2213836.2213935

M3 - Conference contribution

AN - SCOPUS:84862666441

SN - 9781450312479

T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

SP - 693

EP - 696

BT - SIGMOD '12 - Proceedings of the International Conference on Management of Data

T2 - 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12

Y2 - 21 May 2012 through 24 May 2012

ER -

Exploiting MapReduce-based similarity joins

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Cite this