Parallel time series join using spark

Chuitian Rong; Lili Chen; Yasin N. Silva

doi:10.1002/cpe.5622

Parallel time series join using spark

Chuitian Rong, Lili Chen, Yasin N. Silva

Mathematical and Natural Sciences, School of (SMNS)

Research output: Contribution to journal › Article › peer-review

2 Scopus citations

Abstract

A time series is a sequence of data points in successive temporal order. Time series data is produced in many applications scenarios, and the techniques for its analysis have generated substantial interest. Time series join is a primitive operation that retrieves all pairs of correlated subsequences from two given time series. As the Pearson correlation coefficient, a measure of the correlation between two variables, has multiple beneficial mathematical properties, for example, the fact that it is invariant with respect to scale and offset, it is used to measure the correlation between two time series. Considering the need to analyze big time series data, we focus on the study of scalable and distributed techniques to process massive data sets. Specifically, we propose a parallel approach to perform time series joins using Spark, a popular analytics engine for large-scale data processing. Our solution builds on (1) a fast method to compute the fast Fourier transform on the times series to calculate the correlation between two time series, (2) a lossless partition method to divide the time series into multiple subsequences and enable a parallel and correct computation of the join result, and (3) optimization techniques to avoid redundant computations. We performed extensive tests and showed that the proposed approach is efficient and scalable across different data sets and test configurations.

Original language	English (US)
Article number	e5622
Journal	Concurrency Computation Practice and Experience
Volume	32
Issue number	9
DOIs	https://doi.org/10.1002/cpe.5622
State	Published - May 10 2020

Keywords

parallel
partition
spark
time series join

ASJC Scopus subject areas

Theoretical Computer Science
Software
Computer Science Applications
Computer Networks and Communications
Computational Theory and Mathematics

Access to Document

10.1002/cpe.5622

Cite this

@article{dfa9382df1c546a7916fa300412489db,

title = "Parallel time series join using spark",

abstract = "A time series is a sequence of data points in successive temporal order. Time series data is produced in many applications scenarios, and the techniques for its analysis have generated substantial interest. Time series join is a primitive operation that retrieves all pairs of correlated subsequences from two given time series. As the Pearson correlation coefficient, a measure of the correlation between two variables, has multiple beneficial mathematical properties, for example, the fact that it is invariant with respect to scale and offset, it is used to measure the correlation between two time series. Considering the need to analyze big time series data, we focus on the study of scalable and distributed techniques to process massive data sets. Specifically, we propose a parallel approach to perform time series joins using Spark, a popular analytics engine for large-scale data processing. Our solution builds on (1) a fast method to compute the fast Fourier transform on the times series to calculate the correlation between two time series, (2) a lossless partition method to divide the time series into multiple subsequences and enable a parallel and correct computation of the join result, and (3) optimization techniques to avoid redundant computations. We performed extensive tests and showed that the proposed approach is efficient and scalable across different data sets and test configurations.",

keywords = "parallel, partition, spark, time series join",

author = "Chuitian Rong and Lili Chen and Silva, {Yasin N.}",

note = "Funding Information: This work was supported in part by the National Natural Science Foundation of China under grants 61402329 and 61972456 and in part by the Tianjin Natural Science Foundation under grant 19JCYBJC15400. Publisher Copyright: {\textcopyright} 2019 John Wiley & Sons, Ltd.",

year = "2020",

month = may,

day = "10",

doi = "10.1002/cpe.5622",

language = "English (US)",

volume = "32",

journal = "Concurrency Computation Practice and Experience",

issn = "1532-0626",

publisher = "John Wiley and Sons Ltd",

number = "9",

}

TY - JOUR

T1 - Parallel time series join using spark

AU - Rong, Chuitian

AU - Chen, Lili

AU - Silva, Yasin N.

N1 - Funding Information: This work was supported in part by the National Natural Science Foundation of China under grants 61402329 and 61972456 and in part by the Tianjin Natural Science Foundation under grant 19JCYBJC15400. Publisher Copyright: © 2019 John Wiley & Sons, Ltd.

PY - 2020/5/10

Y1 - 2020/5/10

N2 - A time series is a sequence of data points in successive temporal order. Time series data is produced in many applications scenarios, and the techniques for its analysis have generated substantial interest. Time series join is a primitive operation that retrieves all pairs of correlated subsequences from two given time series. As the Pearson correlation coefficient, a measure of the correlation between two variables, has multiple beneficial mathematical properties, for example, the fact that it is invariant with respect to scale and offset, it is used to measure the correlation between two time series. Considering the need to analyze big time series data, we focus on the study of scalable and distributed techniques to process massive data sets. Specifically, we propose a parallel approach to perform time series joins using Spark, a popular analytics engine for large-scale data processing. Our solution builds on (1) a fast method to compute the fast Fourier transform on the times series to calculate the correlation between two time series, (2) a lossless partition method to divide the time series into multiple subsequences and enable a parallel and correct computation of the join result, and (3) optimization techniques to avoid redundant computations. We performed extensive tests and showed that the proposed approach is efficient and scalable across different data sets and test configurations.

AB - A time series is a sequence of data points in successive temporal order. Time series data is produced in many applications scenarios, and the techniques for its analysis have generated substantial interest. Time series join is a primitive operation that retrieves all pairs of correlated subsequences from two given time series. As the Pearson correlation coefficient, a measure of the correlation between two variables, has multiple beneficial mathematical properties, for example, the fact that it is invariant with respect to scale and offset, it is used to measure the correlation between two time series. Considering the need to analyze big time series data, we focus on the study of scalable and distributed techniques to process massive data sets. Specifically, we propose a parallel approach to perform time series joins using Spark, a popular analytics engine for large-scale data processing. Our solution builds on (1) a fast method to compute the fast Fourier transform on the times series to calculate the correlation between two time series, (2) a lossless partition method to divide the time series into multiple subsequences and enable a parallel and correct computation of the join result, and (3) optimization techniques to avoid redundant computations. We performed extensive tests and showed that the proposed approach is efficient and scalable across different data sets and test configurations.

KW - parallel

KW - partition

KW - spark

KW - time series join

UR - http://www.scopus.com/inward/record.url?scp=85077143723&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85077143723&partnerID=8YFLogxK

U2 - 10.1002/cpe.5622

DO - 10.1002/cpe.5622

M3 - Article

AN - SCOPUS:85077143723

SN - 1532-0626

VL - 32

JO - Concurrency Computation Practice and Experience

JF - Concurrency Computation Practice and Experience

IS - 9

M1 - e5622

ER -

Parallel time series join using spark

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this