Valkyrie: Leveraging Inter-TLB Locality to Enhance GPU Performance

PACT '20: International Conference on Parallel Architectures and Compilation Techniques Virtual Event GA USA October, 2020(2020)

引用 12|浏览20
暂无评分
摘要
Programming on a GPU has been made considerably easier with the introduction of Virtual Memory features, which support common pointer-based semantics between the CPU and the GPU. However, supporting virtual memory on a GPU comes with some additional costs and overhead, with the largest being from the support for address translation. The fact that a massive number of threads run concurrently on a GPU means that the translation lookaside buffers (TLBs) are oversubscribed most of the time. Our investigation into a diverse set of GPU workloads shows that TLB misses can be extremely high (up to 99%), which inevitably leads to significant performance degradation due to long-latency page-table walks. Our profiling of TLB-sensitive workloads reveals a high degree of page sharing across the different cores of a GPU. In many applications, a page can be accessed in temporal proximity by multiple cores, following similar memory access patterns. To support the inherent sharing present in GPU workloads, we propose Valkyrie, an integrated cooperative TLB prefetching mechanism and an inter L1-TLB probing scheme that can efficiently reduce TLB bottlenecks in GPUs. Our evaluation using a diverse set of GPU workloads reveals that Valkyrie is able to achieve an average speedup of 1.95x, while adding modest hardware overhead.
更多
查看译文
关键词
GPU, TLB Design, Virtual Memory, TLB Prefetching, TLB Probing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要