CAWS: Criticality-aware warp scheduling for GPGPU workloads

Shin Ying Lee, Carole-Jean Wu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

45 Citations (Scopus)

Abstract

The ability to perform fast context-switching and massive multi-threading is the forte of modern GPU architectures, which have emerged as an efficient alternative to traditional chip-multiprocessors for parallel workloads. One of the main benefits of such architecture is its latency-hiding capability. However, the efficacy of GPU's latency-hiding varies significantly across GPGPU applications. To investigate this, this paper first proposes a new algorithm that profiles execution behavior of GPGPU applications. We characterize latencies caused by various pipeline hazards, memory accesses, synchronization primitives, and the warp scheduler. Our results show that the current round-robin warp scheduler works well in overlapping various latency stalls with the execution of other available warps for only a few GPGPU applications. For other applications, there is an excessive latency stall that cannot be hidden by the scheduler effectively. With the latency characterization insight, we observe a significant execution time disparity for warps within the same thread block, which causes sub-optimal performance, called the warp criticality problem. To tackle the warp criticality problem, we design a family of criticality-aware warp scheduling (CAWS) policies by scheduling the critical warp(s) more frequently than other warps. Our results on the breadth-first-search, B+tree search, two point angular correlation function, and K-means clustering show that, with oracle knowledge of warp criticality, our best-performing scheduling policy can improve GPGPU applications' performance by 17% on average. With our designed criticality predictor, the various scheduling policies can improve performance by 10-21% on breadth-first-search. To our knowledge, this is the first paper to characterize warp criticality and explore different criticality-aware warp scheduling policies for GPGPU workloads.

Original languageEnglish (US)
Title of host publicationParallel Architectures and Compilation Techniques - Conference Proceedings, PACT
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages175-186
Number of pages12
ISBN (Print)9781450328098
DOIs
StatePublished - 2014
Event23rd International Conference on Parallel Architectures and Compilation Techniques, PACT 2014 - Edmonton, AB, Canada
Duration: Aug 24 2014Aug 27 2014

Other

Other23rd International Conference on Parallel Architectures and Compilation Techniques, PACT 2014
CountryCanada
CityEdmonton, AB
Period8/24/148/27/14

Fingerprint

GPGPU
Criticality
Workload
Latency
Scheduling
Scheduling Policy
Scheduler
Breadth-first Search
B-tree
Chip multiprocessors
Multithreading
Hazards
Synchronization
K-means Clustering
Pipelines
Hazard
Thread
Execution Time
Data storage equipment
Overlapping

Keywords

  • gpgpu
  • gpu performance characterization
  • warp/wavefront scheduling

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Theoretical Computer Science

Cite this

Lee, S. Y., & Wu, C-J. (2014). CAWS: Criticality-aware warp scheduling for GPGPU workloads. In Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT (pp. 175-186). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1145/2628071.2628107

CAWS : Criticality-aware warp scheduling for GPGPU workloads. / Lee, Shin Ying; Wu, Carole-Jean.

Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT. Institute of Electrical and Electronics Engineers Inc., 2014. p. 175-186.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Lee, SY & Wu, C-J 2014, CAWS: Criticality-aware warp scheduling for GPGPU workloads. in Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT. Institute of Electrical and Electronics Engineers Inc., pp. 175-186, 23rd International Conference on Parallel Architectures and Compilation Techniques, PACT 2014, Edmonton, AB, Canada, 8/24/14. https://doi.org/10.1145/2628071.2628107
Lee SY, Wu C-J. CAWS: Criticality-aware warp scheduling for GPGPU workloads. In Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT. Institute of Electrical and Electronics Engineers Inc. 2014. p. 175-186 https://doi.org/10.1145/2628071.2628107
Lee, Shin Ying ; Wu, Carole-Jean. / CAWS : Criticality-aware warp scheduling for GPGPU workloads. Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT. Institute of Electrical and Electronics Engineers Inc., 2014. pp. 175-186
@inproceedings{62bf829c73bb44249f6ba421543cbf7e,
title = "CAWS: Criticality-aware warp scheduling for GPGPU workloads",
abstract = "The ability to perform fast context-switching and massive multi-threading is the forte of modern GPU architectures, which have emerged as an efficient alternative to traditional chip-multiprocessors for parallel workloads. One of the main benefits of such architecture is its latency-hiding capability. However, the efficacy of GPU's latency-hiding varies significantly across GPGPU applications. To investigate this, this paper first proposes a new algorithm that profiles execution behavior of GPGPU applications. We characterize latencies caused by various pipeline hazards, memory accesses, synchronization primitives, and the warp scheduler. Our results show that the current round-robin warp scheduler works well in overlapping various latency stalls with the execution of other available warps for only a few GPGPU applications. For other applications, there is an excessive latency stall that cannot be hidden by the scheduler effectively. With the latency characterization insight, we observe a significant execution time disparity for warps within the same thread block, which causes sub-optimal performance, called the warp criticality problem. To tackle the warp criticality problem, we design a family of criticality-aware warp scheduling (CAWS) policies by scheduling the critical warp(s) more frequently than other warps. Our results on the breadth-first-search, B+tree search, two point angular correlation function, and K-means clustering show that, with oracle knowledge of warp criticality, our best-performing scheduling policy can improve GPGPU applications' performance by 17{\%} on average. With our designed criticality predictor, the various scheduling policies can improve performance by 10-21{\%} on breadth-first-search. To our knowledge, this is the first paper to characterize warp criticality and explore different criticality-aware warp scheduling policies for GPGPU workloads.",
keywords = "gpgpu, gpu performance characterization, warp/wavefront scheduling",
author = "Lee, {Shin Ying} and Carole-Jean Wu",
year = "2014",
doi = "10.1145/2628071.2628107",
language = "English (US)",
isbn = "9781450328098",
pages = "175--186",
booktitle = "Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - CAWS

T2 - Criticality-aware warp scheduling for GPGPU workloads

AU - Lee, Shin Ying

AU - Wu, Carole-Jean

PY - 2014

Y1 - 2014

N2 - The ability to perform fast context-switching and massive multi-threading is the forte of modern GPU architectures, which have emerged as an efficient alternative to traditional chip-multiprocessors for parallel workloads. One of the main benefits of such architecture is its latency-hiding capability. However, the efficacy of GPU's latency-hiding varies significantly across GPGPU applications. To investigate this, this paper first proposes a new algorithm that profiles execution behavior of GPGPU applications. We characterize latencies caused by various pipeline hazards, memory accesses, synchronization primitives, and the warp scheduler. Our results show that the current round-robin warp scheduler works well in overlapping various latency stalls with the execution of other available warps for only a few GPGPU applications. For other applications, there is an excessive latency stall that cannot be hidden by the scheduler effectively. With the latency characterization insight, we observe a significant execution time disparity for warps within the same thread block, which causes sub-optimal performance, called the warp criticality problem. To tackle the warp criticality problem, we design a family of criticality-aware warp scheduling (CAWS) policies by scheduling the critical warp(s) more frequently than other warps. Our results on the breadth-first-search, B+tree search, two point angular correlation function, and K-means clustering show that, with oracle knowledge of warp criticality, our best-performing scheduling policy can improve GPGPU applications' performance by 17% on average. With our designed criticality predictor, the various scheduling policies can improve performance by 10-21% on breadth-first-search. To our knowledge, this is the first paper to characterize warp criticality and explore different criticality-aware warp scheduling policies for GPGPU workloads.

AB - The ability to perform fast context-switching and massive multi-threading is the forte of modern GPU architectures, which have emerged as an efficient alternative to traditional chip-multiprocessors for parallel workloads. One of the main benefits of such architecture is its latency-hiding capability. However, the efficacy of GPU's latency-hiding varies significantly across GPGPU applications. To investigate this, this paper first proposes a new algorithm that profiles execution behavior of GPGPU applications. We characterize latencies caused by various pipeline hazards, memory accesses, synchronization primitives, and the warp scheduler. Our results show that the current round-robin warp scheduler works well in overlapping various latency stalls with the execution of other available warps for only a few GPGPU applications. For other applications, there is an excessive latency stall that cannot be hidden by the scheduler effectively. With the latency characterization insight, we observe a significant execution time disparity for warps within the same thread block, which causes sub-optimal performance, called the warp criticality problem. To tackle the warp criticality problem, we design a family of criticality-aware warp scheduling (CAWS) policies by scheduling the critical warp(s) more frequently than other warps. Our results on the breadth-first-search, B+tree search, two point angular correlation function, and K-means clustering show that, with oracle knowledge of warp criticality, our best-performing scheduling policy can improve GPGPU applications' performance by 17% on average. With our designed criticality predictor, the various scheduling policies can improve performance by 10-21% on breadth-first-search. To our knowledge, this is the first paper to characterize warp criticality and explore different criticality-aware warp scheduling policies for GPGPU workloads.

KW - gpgpu

KW - gpu performance characterization

KW - warp/wavefront scheduling

UR - http://www.scopus.com/inward/record.url?scp=84907073162&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84907073162&partnerID=8YFLogxK

U2 - 10.1145/2628071.2628107

DO - 10.1145/2628071.2628107

M3 - Conference contribution

AN - SCOPUS:84907073162

SN - 9781450328098

SP - 175

EP - 186

BT - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT

PB - Institute of Electrical and Electronics Engineers Inc.

ER -