Pangea: Monolithic distributed storage for data analytics

Jia Zou; Arun Iyengar; Chris Jermaine

doi:10.14778/3311880.3311885

Pangea: Monolithic distributed storage for data analytics

Jia Zou, Arun Iyengar, Chris Jermaine

Research output: Contribution to journal › Conference article › peer-review

10 Scopus citations

Abstract

Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and nonshared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper we propose a single system called Pangea that can manage all data-both intermediate and long-lived data, and their buffer/caching, data placement optimization, and failure recovery-all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.

Original language	English (US)
Pages (from-to)	681-694
Number of pages	14
Journal	Proceedings of the VLDB Endowment
Volume	12
Issue number	6
DOIs	https://doi.org/10.14778/3311880.3311885
State	Published - 2018
Externally published	Yes
Event	45th International Conference on Very Large Data Bases, VLDB 2019 - Los Angeles, United States Duration: Aug 26 2017 → Aug 30 2017

ASJC Scopus subject areas

Computer Science (miscellaneous)
General Computer Science

Access to Document

10.14778/3311880.3311885

Cite this

@article{47ad21178b5942608fca3975c18fe514,

title = "Pangea: Monolithic distributed storage for data analytics",

abstract = "Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and nonshared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper we propose a single system called Pangea that can manage all data-both intermediate and long-lived data, and their buffer/caching, data placement optimization, and failure recovery-all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.",

author = "Jia Zou and Arun Iyengar and Chris Jermaine",

note = "Publisher Copyright: {\textcopyright} 2019, Association for Computing Machinery.; 45th International Conference on Very Large Data Bases, VLDB 2019 ; Conference date: 26-08-2017 Through 30-08-2017",

year = "2018",

doi = "10.14778/3311880.3311885",

language = "English (US)",

volume = "12",

pages = "681--694",

journal = "Proceedings of the VLDB Endowment",

issn = "2150-8097",

publisher = "Very Large Data Base Endowment Inc.",

number = "6",

}

TY - JOUR

T1 - Pangea

T2 - 45th International Conference on Very Large Data Bases, VLDB 2019

AU - Zou, Jia

AU - Iyengar, Arun

AU - Jermaine, Chris

PY - 2018

Y1 - 2018

N2 - Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and nonshared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper we propose a single system called Pangea that can manage all data-both intermediate and long-lived data, and their buffer/caching, data placement optimization, and failure recovery-all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.

AB - Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and nonshared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper we propose a single system called Pangea that can manage all data-both intermediate and long-lived data, and their buffer/caching, data placement optimization, and failure recovery-all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.

UR - http://www.scopus.com/inward/record.url?scp=85069487374&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85069487374&partnerID=8YFLogxK

U2 - 10.14778/3311880.3311885

DO - 10.14778/3311880.3311885

M3 - Conference article

AN - SCOPUS:85069487374

SN - 2150-8097

VL - 12

SP - 681

EP - 694

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 6

Y2 - 26 August 2017 through 30 August 2017

ER -

Pangea: Monolithic distributed storage for data analytics

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this