BigCache for big-data systems

Michel Angelo Roger, Yiqi Xu, Ming Zhao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Big-data systems are increasingly used in many disciplines for important tasks such as knowledge discovery and decision making by processing large volumes of data. Big-data systems rely on hard-disk drive (HDD) based storage to provide the necessary capacity. However, as big-data applications grow rapidly more diverse and demanding, HDD storage becomes insufficient to satisfy their performance requirements. Emerging solid-state drives (SSDs) promise great IO performance that can be exploited by big-data applications, but they still face serious limitations in capacity, cost, and endurance and therefore must be strategically incorporated into big-data systems. This paper presents BigCache, an SSD-based distributed caching layer for big-data systems. It is designed to be seamlessly integrated with existing big-data systems and transparently accelerate IOs for diverse big-data applications. The management of the distributed SSD caches in BigCache is coordinated with the job management of big-data systems in order to support cache-locality-driven job scheduling. BigCache is prototyped in Hadoop to provide caching upon HDFS for MapReduce applications. It is evaluated using typical MapReduce applications, and the results show that BigCache reduces the runtime of WordCount by 38% and the runtime of TeraSort by 52%. The results also show that BigCache is able to achieve significant speedup by caching only partial input for the benchmarks, owing to its ability to cache partial input and its replacement policy that recognizes application access patterns.

Original languageEnglish (US)
Title of host publicationProceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages189-194
Number of pages6
ISBN (Print)9781479956654
DOIs
StatePublished - Jan 7 2015
Externally publishedYes
Event2nd IEEE International Conference on Big Data, IEEE Big Data 2014 - Washington, United States
Duration: Oct 27 2014Oct 30 2014

Other

Other2nd IEEE International Conference on Big Data, IEEE Big Data 2014
CountryUnited States
CityWashington
Period10/27/1410/30/14

Fingerprint

Hard disk storage
Big data
Data mining
Durability
Decision making
Scheduling
Processing
Costs

ASJC Scopus subject areas

  • Artificial Intelligence
  • Information Systems

Cite this

Roger, M. A., Xu, Y., & Zhao, M. (2015). BigCache for big-data systems. In Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014 (pp. 189-194). [7004231] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigData.2014.7004231

BigCache for big-data systems. / Roger, Michel Angelo; Xu, Yiqi; Zhao, Ming.

Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. Institute of Electrical and Electronics Engineers Inc., 2015. p. 189-194 7004231.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Roger, MA, Xu, Y & Zhao, M 2015, BigCache for big-data systems. in Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014., 7004231, Institute of Electrical and Electronics Engineers Inc., pp. 189-194, 2nd IEEE International Conference on Big Data, IEEE Big Data 2014, Washington, United States, 10/27/14. https://doi.org/10.1109/BigData.2014.7004231
Roger MA, Xu Y, Zhao M. BigCache for big-data systems. In Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. Institute of Electrical and Electronics Engineers Inc. 2015. p. 189-194. 7004231 https://doi.org/10.1109/BigData.2014.7004231
Roger, Michel Angelo ; Xu, Yiqi ; Zhao, Ming. / BigCache for big-data systems. Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. Institute of Electrical and Electronics Engineers Inc., 2015. pp. 189-194
@inproceedings{45d980959571439abeeaecb35bd14aa9,
title = "BigCache for big-data systems",
abstract = "Big-data systems are increasingly used in many disciplines for important tasks such as knowledge discovery and decision making by processing large volumes of data. Big-data systems rely on hard-disk drive (HDD) based storage to provide the necessary capacity. However, as big-data applications grow rapidly more diverse and demanding, HDD storage becomes insufficient to satisfy their performance requirements. Emerging solid-state drives (SSDs) promise great IO performance that can be exploited by big-data applications, but they still face serious limitations in capacity, cost, and endurance and therefore must be strategically incorporated into big-data systems. This paper presents BigCache, an SSD-based distributed caching layer for big-data systems. It is designed to be seamlessly integrated with existing big-data systems and transparently accelerate IOs for diverse big-data applications. The management of the distributed SSD caches in BigCache is coordinated with the job management of big-data systems in order to support cache-locality-driven job scheduling. BigCache is prototyped in Hadoop to provide caching upon HDFS for MapReduce applications. It is evaluated using typical MapReduce applications, and the results show that BigCache reduces the runtime of WordCount by 38{\%} and the runtime of TeraSort by 52{\%}. The results also show that BigCache is able to achieve significant speedup by caching only partial input for the benchmarks, owing to its ability to cache partial input and its replacement policy that recognizes application access patterns.",
author = "Roger, {Michel Angelo} and Yiqi Xu and Ming Zhao",
year = "2015",
month = "1",
day = "7",
doi = "10.1109/BigData.2014.7004231",
language = "English (US)",
isbn = "9781479956654",
pages = "189--194",
booktitle = "Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - BigCache for big-data systems

AU - Roger, Michel Angelo

AU - Xu, Yiqi

AU - Zhao, Ming

PY - 2015/1/7

Y1 - 2015/1/7

N2 - Big-data systems are increasingly used in many disciplines for important tasks such as knowledge discovery and decision making by processing large volumes of data. Big-data systems rely on hard-disk drive (HDD) based storage to provide the necessary capacity. However, as big-data applications grow rapidly more diverse and demanding, HDD storage becomes insufficient to satisfy their performance requirements. Emerging solid-state drives (SSDs) promise great IO performance that can be exploited by big-data applications, but they still face serious limitations in capacity, cost, and endurance and therefore must be strategically incorporated into big-data systems. This paper presents BigCache, an SSD-based distributed caching layer for big-data systems. It is designed to be seamlessly integrated with existing big-data systems and transparently accelerate IOs for diverse big-data applications. The management of the distributed SSD caches in BigCache is coordinated with the job management of big-data systems in order to support cache-locality-driven job scheduling. BigCache is prototyped in Hadoop to provide caching upon HDFS for MapReduce applications. It is evaluated using typical MapReduce applications, and the results show that BigCache reduces the runtime of WordCount by 38% and the runtime of TeraSort by 52%. The results also show that BigCache is able to achieve significant speedup by caching only partial input for the benchmarks, owing to its ability to cache partial input and its replacement policy that recognizes application access patterns.

AB - Big-data systems are increasingly used in many disciplines for important tasks such as knowledge discovery and decision making by processing large volumes of data. Big-data systems rely on hard-disk drive (HDD) based storage to provide the necessary capacity. However, as big-data applications grow rapidly more diverse and demanding, HDD storage becomes insufficient to satisfy their performance requirements. Emerging solid-state drives (SSDs) promise great IO performance that can be exploited by big-data applications, but they still face serious limitations in capacity, cost, and endurance and therefore must be strategically incorporated into big-data systems. This paper presents BigCache, an SSD-based distributed caching layer for big-data systems. It is designed to be seamlessly integrated with existing big-data systems and transparently accelerate IOs for diverse big-data applications. The management of the distributed SSD caches in BigCache is coordinated with the job management of big-data systems in order to support cache-locality-driven job scheduling. BigCache is prototyped in Hadoop to provide caching upon HDFS for MapReduce applications. It is evaluated using typical MapReduce applications, and the results show that BigCache reduces the runtime of WordCount by 38% and the runtime of TeraSort by 52%. The results also show that BigCache is able to achieve significant speedup by caching only partial input for the benchmarks, owing to its ability to cache partial input and its replacement policy that recognizes application access patterns.

UR - http://www.scopus.com/inward/record.url?scp=84921811460&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84921811460&partnerID=8YFLogxK

U2 - 10.1109/BigData.2014.7004231

DO - 10.1109/BigData.2014.7004231

M3 - Conference contribution

AN - SCOPUS:84921811460

SN - 9781479956654

SP - 189

EP - 194

BT - Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

PB - Institute of Electrical and Electronics Engineers Inc.

ER -