A detailed GPU cache model based on reuse distance theory

High Performance Computer Architecture(2014)

引用 143|浏览173
暂无评分
摘要
As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: (1) the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, (2) conditional and non-uniform latencies, (3) cache associativity, (4) miss-status holding-registers, and (5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.
更多
查看译文
关键词
C++ language,benchmark testing,cache storage,graphics processing units,multi-threading,storage allocation,C++ language,GPU cache model,Ocelot GPU emulator,Parboil benchmark suites,PolyBench/GPU benchmark suites,active thread hierarchy,cache associativity,cache behaviour prediction,cache configurations,cache locality optimisation,cache miss rates,conditional nonuniform latencies,fine-grained multithreading,graphics processing units,mean absolute error,memory address list extraction,miss-status holding-registers,parallel execution model,reuse distance theory,sequential processors,stack distance,thread hierarchy,threadblock hierarchy,warp divergence,warp hierarchy
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要