Abstract
Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and non-shared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper, we propose a single system called Pangea that can manage all data—both intermediate and long-lived data, and their buffer/caching, page replacement, data placement optimization, and failure recovery—all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.
Original language | English (US) |
---|---|
Pages (from-to) | 1049-1073 |
Number of pages | 25 |
Journal | VLDB Journal |
Volume | 29 |
Issue number | 5 |
DOIs | |
State | Published - Sep 1 2020 |
Keywords
- Big Data analytics
- Distributed system
- Heterogeneous replication
- Monolithic storage
ASJC Scopus subject areas
- Information Systems
- Hardware and Architecture