Parallel time series join using spark

Chuitian Rong, Lili Chen, Yasin N. Silva

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

A time series is a sequence of data points in successive temporal order. Time series data is produced in many applications scenarios, and the techniques for its analysis have generated substantial interest. Time series join is a primitive operation that retrieves all pairs of correlated subsequences from two given time series. As the Pearson correlation coefficient, a measure of the correlation between two variables, has multiple beneficial mathematical properties, for example, the fact that it is invariant with respect to scale and offset, it is used to measure the correlation between two time series. Considering the need to analyze big time series data, we focus on the study of scalable and distributed techniques to process massive data sets. Specifically, we propose a parallel approach to perform time series joins using Spark, a popular analytics engine for large-scale data processing. Our solution builds on (1) a fast method to compute the fast Fourier transform on the times series to calculate the correlation between two time series, (2) a lossless partition method to divide the time series into multiple subsequences and enable a parallel and correct computation of the join result, and (3) optimization techniques to avoid redundant computations. We performed extensive tests and showed that the proposed approach is efficient and scalable across different data sets and test configurations.

Original languageEnglish (US)
Article numbere5622
JournalConcurrency Computation Practice and Experience
Volume32
Issue number9
DOIs
StatePublished - May 10 2020

Keywords

  • parallel
  • partition
  • spark
  • time series join

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Software
  • Computer Science Applications
  • Computer Networks and Communications
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Parallel time series join using spark'. Together they form a unique fingerprint.

Cite this