BigCache for big-data systems

Michel Angelo Roger; Yiqi Xu; Ming Zhao

doi:10.1109/BigData.2014.7004231

BigCache for big-data systems

Michel Angelo Roger, Yiqi Xu, Ming Zhao

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

3 Scopus citations

Abstract

Big-data systems are increasingly used in many disciplines for important tasks such as knowledge discovery and decision making by processing large volumes of data. Big-data systems rely on hard-disk drive (HDD) based storage to provide the necessary capacity. However, as big-data applications grow rapidly more diverse and demanding, HDD storage becomes insufficient to satisfy their performance requirements. Emerging solid-state drives (SSDs) promise great IO performance that can be exploited by big-data applications, but they still face serious limitations in capacity, cost, and endurance and therefore must be strategically incorporated into big-data systems. This paper presents BigCache, an SSD-based distributed caching layer for big-data systems. It is designed to be seamlessly integrated with existing big-data systems and transparently accelerate IOs for diverse big-data applications. The management of the distributed SSD caches in BigCache is coordinated with the job management of big-data systems in order to support cache-locality-driven job scheduling. BigCache is prototyped in Hadoop to provide caching upon HDFS for MapReduce applications. It is evaluated using typical MapReduce applications, and the results show that BigCache reduces the runtime of WordCount by 38% and the runtime of TeraSort by 52%. The results also show that BigCache is able to achieve significant speedup by caching only partial input for the benchmarks, owing to its ability to cache partial input and its replacement policy that recognizes application access patterns.

Original language	English (US)
Title of host publication	Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014
Editors	Jimmy Lin, Jian Pei, Xiaohua Tony Hu, Wo Chang, Raghunath Nambiar, Charu Aggarwal, Nick Cercone, Vasant Honavar, Jun Huan, Bamshad Mobasher, Saumyadipta Pyne
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	189-194
Number of pages	6
ISBN (Electronic)	9781479956654
DOIs	https://doi.org/10.1109/BigData.2014.7004231
State	Published - 2014
Externally published	Yes
Event	2nd IEEE International Conference on Big Data, IEEE Big Data 2014 - Washington, United States Duration: Oct 27 2014 → Oct 30 2014

Publication series

Name	Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

Other

Other	2nd IEEE International Conference on Big Data, IEEE Big Data 2014
Country/Territory	United States
City	Washington
Period	10/27/14 → 10/30/14

ASJC Scopus subject areas

Artificial Intelligence
Information Systems

Access to Document

10.1109/BigData.2014.7004231

Cite this

Roger, M. A., Xu, Y., & Zhao, M. (2014). BigCache for big-data systems. In J. Lin, J. Pei, X. T. Hu, W. Chang, R. Nambiar, C. Aggarwal, N. Cercone, V. Honavar, J. Huan, B. Mobasher, & S. Pyne (Eds.), Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014 (pp. 189-194). Article 7004231 (Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigData.2014.7004231

BigCache for big-data systems. / Roger, Michel Angelo; Xu, Yiqi; Zhao, Ming.
Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. ed. / Jimmy Lin; Jian Pei; Xiaohua Tony Hu; Wo Chang; Raghunath Nambiar; Charu Aggarwal; Nick Cercone; Vasant Honavar; Jun Huan; Bamshad Mobasher; Saumyadipta Pyne. Institute of Electrical and Electronics Engineers Inc., 2014. p. 189-194 7004231 (Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Roger, MA, Xu, Y & Zhao, M 2014, BigCache for big-data systems. in J Lin, J Pei, XT Hu, W Chang, R Nambiar, C Aggarwal, N Cercone, V Honavar, J Huan, B Mobasher & S Pyne (eds), Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014., 7004231, Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014, Institute of Electrical and Electronics Engineers Inc., pp. 189-194, 2nd IEEE International Conference on Big Data, IEEE Big Data 2014, Washington, United States, 10/27/14. https://doi.org/10.1109/BigData.2014.7004231

Roger MA, Xu Y, Zhao M. BigCache for big-data systems. In Lin J, Pei J, Hu XT, Chang W, Nambiar R, Aggarwal C, Cercone N, Honavar V, Huan J, Mobasher B, Pyne S, editors, Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. Institute of Electrical and Electronics Engineers Inc. 2014. p. 189-194. 7004231. (Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014). doi: 10.1109/BigData.2014.7004231

Roger, Michel Angelo ; Xu, Yiqi ; Zhao, Ming. / BigCache for big-data systems. Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. editor / Jimmy Lin ; Jian Pei ; Xiaohua Tony Hu ; Wo Chang ; Raghunath Nambiar ; Charu Aggarwal ; Nick Cercone ; Vasant Honavar ; Jun Huan ; Bamshad Mobasher ; Saumyadipta Pyne. Institute of Electrical and Electronics Engineers Inc., 2014. pp. 189-194 (Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014).

@inproceedings{45d980959571439abeeaecb35bd14aa9,

title = "BigCache for big-data systems",

abstract = "Big-data systems are increasingly used in many disciplines for important tasks such as knowledge discovery and decision making by processing large volumes of data. Big-data systems rely on hard-disk drive (HDD) based storage to provide the necessary capacity. However, as big-data applications grow rapidly more diverse and demanding, HDD storage becomes insufficient to satisfy their performance requirements. Emerging solid-state drives (SSDs) promise great IO performance that can be exploited by big-data applications, but they still face serious limitations in capacity, cost, and endurance and therefore must be strategically incorporated into big-data systems. This paper presents BigCache, an SSD-based distributed caching layer for big-data systems. It is designed to be seamlessly integrated with existing big-data systems and transparently accelerate IOs for diverse big-data applications. The management of the distributed SSD caches in BigCache is coordinated with the job management of big-data systems in order to support cache-locality-driven job scheduling. BigCache is prototyped in Hadoop to provide caching upon HDFS for MapReduce applications. It is evaluated using typical MapReduce applications, and the results show that BigCache reduces the runtime of WordCount by 38% and the runtime of TeraSort by 52%. The results also show that BigCache is able to achieve significant speedup by caching only partial input for the benchmarks, owing to its ability to cache partial input and its replacement policy that recognizes application access patterns.",

author = "Roger, {Michel Angelo} and Yiqi Xu and Ming Zhao",

note = "Publisher Copyright: {\textcopyright} 2014 IEEE.; 2nd IEEE International Conference on Big Data, IEEE Big Data 2014 ; Conference date: 27-10-2014 Through 30-10-2014",

year = "2014",

doi = "10.1109/BigData.2014.7004231",

language = "English (US)",

series = "Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "189--194",

editor = "Jimmy Lin and Jian Pei and Hu, {Xiaohua Tony} and Wo Chang and Raghunath Nambiar and Charu Aggarwal and Nick Cercone and Vasant Honavar and Jun Huan and Bamshad Mobasher and Saumyadipta Pyne",

booktitle = "Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014",

}

TY - GEN

T1 - BigCache for big-data systems

AU - Roger, Michel Angelo

AU - Xu, Yiqi

AU - Zhao, Ming

PY - 2014

Y1 - 2014

N2 - Big-data systems are increasingly used in many disciplines for important tasks such as knowledge discovery and decision making by processing large volumes of data. Big-data systems rely on hard-disk drive (HDD) based storage to provide the necessary capacity. However, as big-data applications grow rapidly more diverse and demanding, HDD storage becomes insufficient to satisfy their performance requirements. Emerging solid-state drives (SSDs) promise great IO performance that can be exploited by big-data applications, but they still face serious limitations in capacity, cost, and endurance and therefore must be strategically incorporated into big-data systems. This paper presents BigCache, an SSD-based distributed caching layer for big-data systems. It is designed to be seamlessly integrated with existing big-data systems and transparently accelerate IOs for diverse big-data applications. The management of the distributed SSD caches in BigCache is coordinated with the job management of big-data systems in order to support cache-locality-driven job scheduling. BigCache is prototyped in Hadoop to provide caching upon HDFS for MapReduce applications. It is evaluated using typical MapReduce applications, and the results show that BigCache reduces the runtime of WordCount by 38% and the runtime of TeraSort by 52%. The results also show that BigCache is able to achieve significant speedup by caching only partial input for the benchmarks, owing to its ability to cache partial input and its replacement policy that recognizes application access patterns.

AB - Big-data systems are increasingly used in many disciplines for important tasks such as knowledge discovery and decision making by processing large volumes of data. Big-data systems rely on hard-disk drive (HDD) based storage to provide the necessary capacity. However, as big-data applications grow rapidly more diverse and demanding, HDD storage becomes insufficient to satisfy their performance requirements. Emerging solid-state drives (SSDs) promise great IO performance that can be exploited by big-data applications, but they still face serious limitations in capacity, cost, and endurance and therefore must be strategically incorporated into big-data systems. This paper presents BigCache, an SSD-based distributed caching layer for big-data systems. It is designed to be seamlessly integrated with existing big-data systems and transparently accelerate IOs for diverse big-data applications. The management of the distributed SSD caches in BigCache is coordinated with the job management of big-data systems in order to support cache-locality-driven job scheduling. BigCache is prototyped in Hadoop to provide caching upon HDFS for MapReduce applications. It is evaluated using typical MapReduce applications, and the results show that BigCache reduces the runtime of WordCount by 38% and the runtime of TeraSort by 52%. The results also show that BigCache is able to achieve significant speedup by caching only partial input for the benchmarks, owing to its ability to cache partial input and its replacement policy that recognizes application access patterns.

UR - http://www.scopus.com/inward/record.url?scp=84921811460&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84921811460&partnerID=8YFLogxK

U2 - 10.1109/BigData.2014.7004231

DO - 10.1109/BigData.2014.7004231

M3 - Conference contribution

AN - SCOPUS:84921811460

T3 - Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

SP - 189

EP - 194

BT - Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

A2 - Lin, Jimmy

A2 - Pei, Jian

A2 - Hu, Xiaohua Tony

A2 - Chang, Wo

A2 - Nambiar, Raghunath

A2 - Aggarwal, Charu

A2 - Cercone, Nick

A2 - Honavar, Vasant

A2 - Huan, Jun

A2 - Mobasher, Bamshad

A2 - Pyne, Saumyadipta

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2nd IEEE International Conference on Big Data, IEEE Big Data 2014

Y2 - 27 October 2014 through 30 October 2014

ER -

BigCache for big-data systems

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this