z-logo
open-access-imgOpen Access
Improving GPU performance through instruction redistribution and diversification
Author(s) -
Minglun Gong
Publication year - 2018
Language(s) - English
Resource type - Dissertations/theses
DOI - 10.17760/d20294182
Subject(s) - parallel computing , computer science , thread (computing) , cuda , parallelism (grammar) , task parallelism , general purpose computing on graphics processing units , computer architecture , operating system , graphics
of the Dissertation Improving GPU Performance through Instruction Redistribution and Diversification by Xiang Gong Doctor of Philosophy in Computer Engineering Northeastern University, August 2018 Professor David Kaeli, Advisor As throughput-oriented accelerators, GPUs provide tremendous processing power by executing a massive number of threads in parallel. However, exploiting high degrees of thread-level parallelism (TLP) does not always translate to the peak performance that GPUs can offer, leaving the GPUs resources often under-utilized. Compared to compute resources, memory resources can tolerate considerably lower levels of TLP due to hardware bottlenecks. Unfortunately, this tolerance is not effectively exploited by the Single Instruction Multiple Thread (SIMT) execution model employed by current GPU compute frameworks. Assuming a SIMT execution model, GPU applications frequently send bursts of memory requests that compete for GPU memory resources. Traditionally, hardware units, such as the wavefront scheduler, are used to manage such requests. Compute-bound threads can be scheduled to utilize compute resources while memory requests are serviced. However, the scheduler struggles when the number of memory operations dominates execution, unable to effectively hide the long latency of memory operations. The degree of instruction diversity present in a single application may also be insufficient to fully utilize the resources on a GPU. GPU workloads tend to stress a particular hardware resource, but can leave others under-utilized. Using coarse-grained hardware resource sharing techniques, such as concurrent kernel execution, fails to guarantee that GPU hardware resources are truly shared by different kernels. Introducing additional kernels that utilize similar resources may introduce more contention to the system, especially if kernel candidates fail to use hardware resources collaboratively. Most previous studies considered the goal of achieving GPU peak performance as a hardware issue. Extensive efforts have been made to remove hardware bottlenecks to improve efficiency. In this thesis, we argue that software plays an equal, if not more important, role. We

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom