Similarity Group-By operators for multi-dimensional relational data

Mingjie Tang, Ruby Y. Tahboub, Walid G. Aref, Mikhail J. Atallah, Qutaibah M. Malluhi, Mourad Ouzzani, Yasin Silva

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.

Original languageEnglish (US)
Title of host publication2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1448-1449
Number of pages2
ISBN (Electronic)9781509020195
DOIs
StatePublished - Jun 22 2016
Event32nd IEEE International Conference on Data Engineering, ICDE 2016 - Helsinki, Finland
Duration: May 16 2016May 20 2016

Publication series

Name2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016

Other

Other32nd IEEE International Conference on Data Engineering, ICDE 2016
Country/TerritoryFinland
CityHelsinki
Period5/16/165/20/16

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computational Theory and Mathematics
  • Computer Graphics and Computer-Aided Design
  • Computer Networks and Communications
  • Information Systems
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Similarity Group-By operators for multi-dimensional relational data'. Together they form a unique fingerprint.

Cite this