Pangea: Monolithic distributed storage for data analytics

Jia Zou, Arun Iyengar, Chris Jermaine

Research output: Contribution to journalConference article

Abstract

Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and nonshared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper we propose a single system called Pangea that can manage all data-both intermediate and long-lived data, and their buffer/caching, data placement optimization, and failure recovery-all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.

Original languageEnglish (US)
Pages (from-to)681-694
Number of pages14
JournalProceedings of the VLDB Endowment
Volume12
Issue number6
DOIs
StatePublished - Jan 1 2018
Externally publishedYes
Event45th International Conference on Very Large Data Bases, VLDB 2019 - Los Angeles, United States
Duration: Aug 26 2017Aug 30 2017

    Fingerprint

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this