RegMutex: Inter-Warp GPU Register Time-Sharing.

ISCA(2018)

引用 50|浏览120
暂无评分
摘要
Registers are the fastest and simultaneously the most expensive kind of memory available to GPU threads. Due to existence of a great number of concurrently executing threads, and the high cost of context switching mechanisms, contemporary GPUs are equipped with large register files. However, to avoid over-complicating the hardware, registers are statically assigned and exclusively dedicated to threads for the entire duration of the thread's lifetime. This decomposition takes into account the maximum number of live registers at any given point in the GPU binary although the points at which all the requested registers are used may constitute only a small fraction of the whole program. Therefore, a considerable portion of the register file remains under-utilized. In this paper, we propose a software-hardware co-mechanism named RegMutex (Register Mutual Exclusion) to share a subset of physical registers between warps during the GPU kernel execution. With RegMutex, the compiler divides the architected register set into a base register set and an extended register set. While physical registers corresponding to the base register set are statically and exclusively assigned to the warp, the hardware time-shares the remaining physical registers across warps to provision their extended register set. Therefore, the GPU programs can sustain approximately the same performance with the lower number of registers hence yielding higher performance per dollar. For programs that require a large number of registers for execution, RegMutex will enable a higher number of concurrent warps to be resident in the hardware via sharing their register allocations with each other, leading to a higher device occupancy. Since some aspects of register sharing orchestration are being offloaded to the compiler, RegMutex introduces lower hardware complexity compared to existing approaches. Our experiments show that RegMutex improves the register utilization and reduces the number of execution cycles by up to 23% for kernels demanding a high number of registers.
更多
查看译文
关键词
compiler, GPGPU, GPU, microarchitecture, register, register file, RegMutex, time-multiplexing, time-sharing, warp
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要