Efficient Warp Execution In Presence Of Divergence With Collaborative Context Collection
MICRO(2015)
摘要
GPU's SIMD architecture is a double-edged sword confronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet power efficient platform to accelerate applications via massive parallelism; however, on the other hand, irregularities induce in efficiencies due to the warp's lockstep traversal of all diverging execution paths. In this work, we present a software (compiler) technique named Colla borative Context Collection (CCC) that increases the warp execution efficiency when faced with thread divergence incurred either by different intra-warp task assignment or by intra-warp load imbalance. CCC collects the relevant registers of divergent thread sina warp-specific stack allocated in the fast shared memory, and restores the monly when the perfectutilization of warp lanes becomes feasible. We propose code transformations to enable applicability of CCC to variety of programs egments with thread divergence. We also introduce optimizations to reduce the cost of CCC and to avoid device occupancy limitation or memory divergence. We have developed a frame work that automates application of CCC to CUDA generated intermediate PTX code. We evaluated CCC on real-world applications and multiple scenarios using synthetic programs. CCC improves the warp execution efficiency of real-world benchmarks by upto 56% and achieves anaverage speed up of 1.69x (maximum 3.08x)
更多查看译文
关键词
GPU,GPGPU,warp,SIMD,SIMT,warp execution,divergence,CCC,context stack
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络