TY - GEN
T1 - ID-cache
T2 - 2016 IEEE International Symposium on Workload Characterization, IISWC 2016
AU - Arunkumar, Akhil
AU - Lee, Shin Ying
AU - Wu, Carole-Jean
N1 - Funding Information:
The authors would like to thank the anonymous reviewers for their insightful feedback. This work is supported in part by the National Science Foundation (Grant #CCF-1618039) and by Science Foundation Arizona under the Bisgrove Early Career Scholarship.
Publisher Copyright:
© 2016 IEEE.
PY - 2016/10/3
Y1 - 2016/10/3
N2 - Modern graphic processing units (GPUs) are not only able to perform graphics rendering, but also perform general purpose parallel computations (GPGPUs). It has been shown that the GPU L1 data cache and the on chip interconnect bandwidth are important sources of performance bottlenecks and inefficiencies in GPGPUs. Through this work, we aim to understand the sources of inefficiencies and possible opportunities for more efficient cache and interconnect bandwidth management on the GPUs. We do so by understanding the predictability of reuse behavior and spatial utilization of cache lines using program level information such as the instruction PC, and runtime behavior such as the extent of memory divergence. Through our characterization results, we demonstrate that a) PC, and memory divergence can be used to efficiently bypass zero reuse cache lines from the cache; b) memory divergence information can further be used to dynamically insert cache lines of varying size granularities based on their spatial utilization. Finally, based on the insights derived through our characterization, we design a simple Instruction and memory Divergence cache management method that is able to achieve an average of 71% performance improvement for a wide variety of cache and interconnect sensitive applications.
AB - Modern graphic processing units (GPUs) are not only able to perform graphics rendering, but also perform general purpose parallel computations (GPGPUs). It has been shown that the GPU L1 data cache and the on chip interconnect bandwidth are important sources of performance bottlenecks and inefficiencies in GPGPUs. Through this work, we aim to understand the sources of inefficiencies and possible opportunities for more efficient cache and interconnect bandwidth management on the GPUs. We do so by understanding the predictability of reuse behavior and spatial utilization of cache lines using program level information such as the instruction PC, and runtime behavior such as the extent of memory divergence. Through our characterization results, we demonstrate that a) PC, and memory divergence can be used to efficiently bypass zero reuse cache lines from the cache; b) memory divergence information can further be used to dynamically insert cache lines of varying size granularities based on their spatial utilization. Finally, based on the insights derived through our characterization, we design a simple Instruction and memory Divergence cache management method that is able to achieve an average of 71% performance improvement for a wide variety of cache and interconnect sensitive applications.
UR - http://www.scopus.com/inward/record.url?scp=84994779312&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84994779312&partnerID=8YFLogxK
U2 - 10.1109/IISWC.2016.7581276
DO - 10.1109/IISWC.2016.7581276
M3 - Conference contribution
AN - SCOPUS:84994779312
T3 - Proceedings of the 2016 IEEE International Symposium on Workload Characterization, IISWC 2016
SP - 158
EP - 167
BT - Proceedings of the 2016 IEEE International Symposium on Workload Characterization, IISWC 2016
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 25 September 2016 through 27 September 2016
ER -