Similarity Group-By operators for multi-dimensional relational data

Mingjie Tang; Ruby Y. Tahboub; Walid G. Aref; Mikhail J. Atallah; Qutaibah M. Malluhi; Mourad Ouzzani; Yasin Silva

doi:10.1109/ICDE.2016.7498368

Similarity Group-By operators for multi-dimensional relational data

Mingjie Tang, Ruby Y. Tahboub, Walid G. Aref, Mikhail J. Atallah, Qutaibah M. Malluhi, Mourad Ouzzani, Yasin Silva

Arizona State University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

2 Scopus citations

Abstract

The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.

Original language	English (US)
Title of host publication	2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	1448-1449
Number of pages	2
ISBN (Electronic)	9781509020195
DOIs	https://doi.org/10.1109/ICDE.2016.7498368
State	Published - Jun 22 2016
Event	32nd IEEE International Conference on Data Engineering, ICDE 2016 - Helsinki, Finland Duration: May 16 2016 → May 20 2016

Publication series

Name	2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016

Other

Other	32nd IEEE International Conference on Data Engineering, ICDE 2016
Country/Territory	Finland
City	Helsinki
Period	5/16/16 → 5/20/16

ASJC Scopus subject areas

Artificial Intelligence
Computational Theory and Mathematics
Computer Graphics and Computer-Aided Design
Computer Networks and Communications
Information Systems
Information Systems and Management

Access to Document

10.1109/ICDE.2016.7498368

Cite this

Tang, M., Tahboub, R. Y., Aref, W. G., Atallah, M. J., Malluhi, Q. M., Ouzzani, M., & Silva, Y. (2016). Similarity Group-By operators for multi-dimensional relational data. In 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016 (pp. 1448-1449). Article 7498368 (2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICDE.2016.7498368

Similarity Group-By operators for multi-dimensional relational data. / Tang, Mingjie; Tahboub, Ruby Y.; Aref, Walid G. et al.
2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016. Institute of Electrical and Electronics Engineers Inc., 2016. p. 1448-1449 7498368 (2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Tang, M, Tahboub, RY, Aref, WG, Atallah, MJ, Malluhi, QM, Ouzzani, M & Silva, Y 2016, Similarity Group-By operators for multi-dimensional relational data. in 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016., 7498368, 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016, Institute of Electrical and Electronics Engineers Inc., pp. 1448-1449, 32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, 5/16/16. https://doi.org/10.1109/ICDE.2016.7498368

Tang M, Tahboub RY, Aref WG, Atallah MJ, Malluhi QM, Ouzzani M et al. Similarity Group-By operators for multi-dimensional relational data. In 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016. Institute of Electrical and Electronics Engineers Inc. 2016. p. 1448-1449. 7498368. (2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016). doi: 10.1109/ICDE.2016.7498368

@inproceedings{ff5df333076a424a8a1ab0d41751c28d,

title = "Similarity Group-By operators for multi-dimensional relational data",

abstract = "The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.",

author = "Mingjie Tang and Tahboub, {Ruby Y.} and Aref, {Walid G.} and Atallah, {Mikhail J.} and Malluhi, {Qutaibah M.} and Mourad Ouzzani and Yasin Silva",

note = "Publisher Copyright: {\textcopyright} 2016 IEEE.; 32nd IEEE International Conference on Data Engineering, ICDE 2016 ; Conference date: 16-05-2016 Through 20-05-2016",

year = "2016",

month = jun,

day = "22",

doi = "10.1109/ICDE.2016.7498368",

language = "English (US)",

series = "2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "1448--1449",

booktitle = "2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016",

}

TY - GEN

T1 - Similarity Group-By operators for multi-dimensional relational data

AU - Tang, Mingjie

AU - Tahboub, Ruby Y.

AU - Aref, Walid G.

AU - Atallah, Mikhail J.

AU - Malluhi, Qutaibah M.

AU - Ouzzani, Mourad

AU - Silva, Yasin

PY - 2016/6/22

Y1 - 2016/6/22

N2 - The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.

AB - The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.

UR - http://www.scopus.com/inward/record.url?scp=84980325558&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84980325558&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2016.7498368

DO - 10.1109/ICDE.2016.7498368

M3 - Conference contribution

AN - SCOPUS:84980325558

T3 - 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016

SP - 1448

EP - 1449

BT - 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 32nd IEEE International Conference on Data Engineering, ICDE 2016

Y2 - 16 May 2016 through 20 May 2016

ER -

Similarity Group-By operators for multi-dimensional relational data

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this