Similarity Grouping in Big Data Systems

Yasin N. Silva; Manuel Sandoval; Diana Prado; Xavier Wallace; Chuitian Rong

doi:10.1007/978-3-030-32047-8_19

Similarity Grouping in Big Data Systems

Yasin N. Silva, Manuel Sandoval, Diana Prado, Xavier Wallace, Chuitian Rong

Mathematical and Natural Sciences, School of (SMNS)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

1 Scopus citations

Abstract

Distributed computing technologies have opened the door for a wide range of organizations to analyze massive amounts of data. Grouping (fast but based on exact semantics) and clustering (relatively slow but based on similarity-aware semantics) are among the most useful data analysis operations. Previous work introduced the Similarity Grouping (SG) operator, which aims to integrate the best features of grouping and clustering, i.e., fast execution times and similarity-aware grouping semantics. The SG operators, however, were proposed for single node relational database systems. This paper introduces the Distributed Similarity Grouping (DSG) operator, a highly parallel operator for identifying similarity groups in big datasets. DSG enables the identification of groups where all the elements are within a given threshold from each other. This paper presents DSG’s design details, implementation guidelines on Spark and Hadoop (two important Big Data systems), and extensive performance and scalability evaluation.

Original language	English (US)
Title of host publication	Similarity Search and Applications - 12th International Conference, SISAP 2019, Proceedings
Editors	Giuseppe Amato, Claudio Gennaro, Vincent Oria, Miloš Radovanovic
Publisher	Springer
Pages	212-220
Number of pages	9
ISBN (Print)	9783030320461
DOIs	https://doi.org/10.1007/978-3-030-32047-8_19
State	Published - 2019
Event	12th International Conference on Similarity Search and Applications, SISAP 2019 - Newark, United States Duration: Oct 2 2019 → Oct 4 2019

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	11807 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	12th International Conference on Similarity Search and Applications, SISAP 2019
Country/Territory	United States
City	Newark
Period	10/2/19 → 10/4/19

Keywords

Big data systems
Clustering
Hadoop
MapReduce
Performance evaluation
Similarity grouping
Spark

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-030-32047-8_19

Cite this

Silva, Y. N., Sandoval, M., Prado, D., Wallace, X., & Rong, C. (2019). Similarity Grouping in Big Data Systems. In G. Amato, C. Gennaro, V. Oria, & M. Radovanovic (Eds.), Similarity Search and Applications - 12th International Conference, SISAP 2019, Proceedings (pp. 212-220). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11807 LNCS). Springer. https://doi.org/10.1007/978-3-030-32047-8_19

Similarity Grouping in Big Data Systems. / Silva, Yasin N.; Sandoval, Manuel; Prado, Diana et al.
Similarity Search and Applications - 12th International Conference, SISAP 2019, Proceedings. ed. / Giuseppe Amato; Claudio Gennaro; Vincent Oria; Miloš Radovanovic. Springer, 2019. p. 212-220 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11807 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Silva, YN, Sandoval, M, Prado, D, Wallace, X & Rong, C 2019, Similarity Grouping in Big Data Systems. in G Amato, C Gennaro, V Oria & M Radovanovic (eds), Similarity Search and Applications - 12th International Conference, SISAP 2019, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11807 LNCS, Springer, pp. 212-220, 12th International Conference on Similarity Search and Applications, SISAP 2019, Newark, United States, 10/2/19. https://doi.org/10.1007/978-3-030-32047-8_19

Silva YN, Sandoval M, Prado D, Wallace X, Rong C. Similarity Grouping in Big Data Systems. In Amato G, Gennaro C, Oria V, Radovanovic M, editors, Similarity Search and Applications - 12th International Conference, SISAP 2019, Proceedings. Springer. 2019. p. 212-220. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-32047-8_19

Silva, Yasin N. ; Sandoval, Manuel ; Prado, Diana et al. / Similarity Grouping in Big Data Systems. Similarity Search and Applications - 12th International Conference, SISAP 2019, Proceedings. editor / Giuseppe Amato ; Claudio Gennaro ; Vincent Oria ; Miloš Radovanovic. Springer, 2019. pp. 212-220 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{5b36703308814e7c97cd93826c4b10e0,

title = "Similarity Grouping in Big Data Systems",

abstract = "Distributed computing technologies have opened the door for a wide range of organizations to analyze massive amounts of data. Grouping (fast but based on exact semantics) and clustering (relatively slow but based on similarity-aware semantics) are among the most useful data analysis operations. Previous work introduced the Similarity Grouping (SG) operator, which aims to integrate the best features of grouping and clustering, i.e., fast execution times and similarity-aware grouping semantics. The SG operators, however, were proposed for single node relational database systems. This paper introduces the Distributed Similarity Grouping (DSG) operator, a highly parallel operator for identifying similarity groups in big datasets. DSG enables the identification of groups where all the elements are within a given threshold from each other. This paper presents DSG{\textquoteright}s design details, implementation guidelines on Spark and Hadoop (two important Big Data systems), and extensive performance and scalability evaluation.",

keywords = "Big data systems, Clustering, Hadoop, MapReduce, Performance evaluation, Similarity grouping, Spark",

author = "Silva, {Yasin N.} and Manuel Sandoval and Diana Prado and Xavier Wallace and Chuitian Rong",

note = "Publisher Copyright: {\textcopyright} 2019, Springer Nature Switzerland AG.; 12th International Conference on Similarity Search and Applications, SISAP 2019 ; Conference date: 02-10-2019 Through 04-10-2019",

year = "2019",

doi = "10.1007/978-3-030-32047-8_19",

language = "English (US)",

isbn = "9783030320461",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer",

pages = "212--220",

editor = "Giuseppe Amato and Claudio Gennaro and Vincent Oria and Milo{\v s} Radovanovic",

booktitle = "Similarity Search and Applications - 12th International Conference, SISAP 2019, Proceedings",

}

TY - GEN

T1 - Similarity Grouping in Big Data Systems

AU - Silva, Yasin N.

AU - Sandoval, Manuel

AU - Prado, Diana

AU - Wallace, Xavier

AU - Rong, Chuitian

PY - 2019

Y1 - 2019

N2 - Distributed computing technologies have opened the door for a wide range of organizations to analyze massive amounts of data. Grouping (fast but based on exact semantics) and clustering (relatively slow but based on similarity-aware semantics) are among the most useful data analysis operations. Previous work introduced the Similarity Grouping (SG) operator, which aims to integrate the best features of grouping and clustering, i.e., fast execution times and similarity-aware grouping semantics. The SG operators, however, were proposed for single node relational database systems. This paper introduces the Distributed Similarity Grouping (DSG) operator, a highly parallel operator for identifying similarity groups in big datasets. DSG enables the identification of groups where all the elements are within a given threshold from each other. This paper presents DSG’s design details, implementation guidelines on Spark and Hadoop (two important Big Data systems), and extensive performance and scalability evaluation.

AB - Distributed computing technologies have opened the door for a wide range of organizations to analyze massive amounts of data. Grouping (fast but based on exact semantics) and clustering (relatively slow but based on similarity-aware semantics) are among the most useful data analysis operations. Previous work introduced the Similarity Grouping (SG) operator, which aims to integrate the best features of grouping and clustering, i.e., fast execution times and similarity-aware grouping semantics. The SG operators, however, were proposed for single node relational database systems. This paper introduces the Distributed Similarity Grouping (DSG) operator, a highly parallel operator for identifying similarity groups in big datasets. DSG enables the identification of groups where all the elements are within a given threshold from each other. This paper presents DSG’s design details, implementation guidelines on Spark and Hadoop (two important Big Data systems), and extensive performance and scalability evaluation.

KW - Big data systems

KW - Clustering

KW - Hadoop

KW - MapReduce

KW - Performance evaluation

KW - Similarity grouping

KW - Spark

UR - http://www.scopus.com/inward/record.url?scp=85076095957&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85076095957&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-32047-8_19

DO - 10.1007/978-3-030-32047-8_19

M3 - Conference contribution

AN - SCOPUS:85076095957

SN - 9783030320461

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 212

EP - 220

BT - Similarity Search and Applications - 12th International Conference, SISAP 2019, Proceedings

A2 - Amato, Giuseppe

A2 - Gennaro, Claudio

A2 - Oria, Vincent

A2 - Radovanovic, Miloš

PB - Springer

T2 - 12th International Conference on Similarity Search and Applications, SISAP 2019

Y2 - 2 October 2019 through 4 October 2019

ER -

Similarity Grouping in Big Data Systems

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this