Similarity Grouping in Big Data Systems

Yasin N. Silva, Manuel Sandoval, Diana Prado, Xavier Wallace, Chuitian Rong

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Distributed computing technologies have opened the door for a wide range of organizations to analyze massive amounts of data. Grouping (fast but based on exact semantics) and clustering (relatively slow but based on similarity-aware semantics) are among the most useful data analysis operations. Previous work introduced the Similarity Grouping (SG) operator, which aims to integrate the best features of grouping and clustering, i.e., fast execution times and similarity-aware grouping semantics. The SG operators, however, were proposed for single node relational database systems. This paper introduces the Distributed Similarity Grouping (DSG) operator, a highly parallel operator for identifying similarity groups in big datasets. DSG enables the identification of groups where all the elements are within a given threshold from each other. This paper presents DSG’s design details, implementation guidelines on Spark and Hadoop (two important Big Data systems), and extensive performance and scalability evaluation.

Original languageEnglish (US)
Title of host publicationSimilarity Search and Applications - 12th International Conference, SISAP 2019, Proceedings
EditorsGiuseppe Amato, Claudio Gennaro, Vincent Oria, Miloš Radovanovic
PublisherSpringer
Pages212-220
Number of pages9
ISBN (Print)9783030320461
DOIs
StatePublished - Jan 1 2019
Externally publishedYes
Event12th International Conference on Similarity Search and Applications, SISAP 2019 - Newark, United States
Duration: Oct 2 2019Oct 4 2019

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11807 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference12th International Conference on Similarity Search and Applications, SISAP 2019
CountryUnited States
CityNewark
Period10/2/1910/4/19

Keywords

  • Big data systems
  • Clustering
  • Hadoop
  • MapReduce
  • Performance evaluation
  • Similarity grouping
  • Spark

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Similarity Grouping in Big Data Systems'. Together they form a unique fingerprint.

Cite this