TY - GEN
T1 - Performance characterization, prediction, and optimization for heterogeneous systems with multi-level memory interference
AU - Lee, Shin Ying
AU - Wu, Carole-Jean
N1 - Funding Information:
In order to balance the performance degradation of a heterogeneous system, we develop a light-weight and scalable performance degradation predictor (HeteroPDP), based on simple regression models. HeteroPDP can accurately select the target device in a heterogeneous system to optimize and balance the performance degradation among all co-located workloads. HeteroPDP is designed and implemented within the existing OpenCL framework, and is evaluated on a real system consisting of an Intel Core i7-3770 CMP and an AMD FirePro GPU. Overall, HeteroPDP improves the performance of OpenCL applications by 3X by intelligently selecting the execution target between the host CMP and the GPU while the always offloading to GPU decision produces 2.5X speedup. This paper shows that a simple regression model approach and the consideration of the multi-level memory interference in HeteroPDP can effectively improve the scheduling decision of OpenCL applications, leading to higher application performance and system throughput. ACKNOWLEDGMENT The authors would like to thank the paper shepherd Dr. Sandeep Agrawal/Oracle and the anonymous reviewers for their useful feedback. This work is supported in part by the National Science Foundation (under grants CCF #1618039 and CCF #1652132).
Funding Information:
The authors would like to thank the paper shepherd Dr. Sandeep Agrawal/Oracle and the anonymous reviewers for their useful feedback. This work is supported in part by the National Science Foundation (under grants CCF #1618039 and CCF #1652132).
Publisher Copyright:
© 2017 IEEE.
PY - 2017/12/5
Y1 - 2017/12/5
N2 - Modern computer systems are accelerator-rich, equipped with many types of hardware accelerators to speed up computation. For example, graphics processing units (GPUs) are a type of accelerators that are widely employed to accelerate parallel workloads. In order to well utilize different accelerators to gain better execution time speedup or reduce total energy consumption, many scheduling algorithms have been proposed to select the optimal target device to process an OpenCL kernel according to the kernel's individual characteristics. However, in a real computer system, there are a lot of workloads co-located together on a single machine and would be processed on different devices simultaneously. The CPU cores and accelerators may contend shared resources, such as the host main memory and shared last-level cache. Thus, it is not robust to schedule an OpenCL kernel execution by simply considering the characteristics of the kernel. To maximize the system throughput, it is important to consider the execution behavior of all co-located applications when performing OpenCL kernel execution scheduling. In this paper, we provide a detailed characterization study demonstrating that scheduling an OpenCL kernel to run on different devices can introduce varying performance impact to itself and the other co-located applications due to memory interference. Based on the characterization results, we then develop a light-weight, scalable performance degradation predictor specifically for heterogeneous computer systems, called HeteroPDP. HeteroPDP aims to dynamically predict and balance the execution time slowdown of all co-located applications in a heterogeneous computation environment. Our real system evaluation results show that comparing with always running an OpenCL kernel on the host CPU, HeteroPDP is able to achieve 3X execution time speedup when an OpenCL kernel runs alone and improve the system fairness from 24% to 65% when an OpenCL kernel is co-located with other applications.
AB - Modern computer systems are accelerator-rich, equipped with many types of hardware accelerators to speed up computation. For example, graphics processing units (GPUs) are a type of accelerators that are widely employed to accelerate parallel workloads. In order to well utilize different accelerators to gain better execution time speedup or reduce total energy consumption, many scheduling algorithms have been proposed to select the optimal target device to process an OpenCL kernel according to the kernel's individual characteristics. However, in a real computer system, there are a lot of workloads co-located together on a single machine and would be processed on different devices simultaneously. The CPU cores and accelerators may contend shared resources, such as the host main memory and shared last-level cache. Thus, it is not robust to schedule an OpenCL kernel execution by simply considering the characteristics of the kernel. To maximize the system throughput, it is important to consider the execution behavior of all co-located applications when performing OpenCL kernel execution scheduling. In this paper, we provide a detailed characterization study demonstrating that scheduling an OpenCL kernel to run on different devices can introduce varying performance impact to itself and the other co-located applications due to memory interference. Based on the characterization results, we then develop a light-weight, scalable performance degradation predictor specifically for heterogeneous computer systems, called HeteroPDP. HeteroPDP aims to dynamically predict and balance the execution time slowdown of all co-located applications in a heterogeneous computation environment. Our real system evaluation results show that comparing with always running an OpenCL kernel on the host CPU, HeteroPDP is able to achieve 3X execution time speedup when an OpenCL kernel runs alone and improve the system fairness from 24% to 65% when an OpenCL kernel is co-located with other applications.
UR - http://www.scopus.com/inward/record.url?scp=85046448503&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85046448503&partnerID=8YFLogxK
U2 - 10.1109/IISWC.2017.8167755
DO - 10.1109/IISWC.2017.8167755
M3 - Conference contribution
AN - SCOPUS:85046448503
T3 - Proceedings of the 2017 IEEE International Symposium on Workload Characterization, IISWC 2017
SP - 43
EP - 53
BT - Proceedings of the 2017 IEEE International Symposium on Workload Characterization, IISWC 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2017 IEEE International Symposium on Workload Characterization, IISWC 2017
Y2 - 1 October 2017 through 3 October 2017
ER -