Decoupled Vector Runahead

56TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, MICRO 2023(2023)

引用 0|浏览5
暂无评分
摘要
We present Decoupled Vector Runahead (DVR), an in-core prefetching technique, executing separately to the main application thread, that exploits massive amounts of memory-level parallelism to improve the performance of applications featuring indirect memory accesses. DVR dynamically infers loop bounds at run-time, recognizing striding loads, and vectorizing subsequent instructions that are part of an indirect chain. It proactively issues memory accesses for the resulting loads far into the future, even when the out-oforder core has not yet stalled, bringing their data into the L1 cache, and thus providing timely prefetches for the main thread. DVR can adjust the degree of vectorization at run-time, vectorize the same chain of indirect memory accesses across multiple invocations of an inner loop, and efficiently handle branch divergence along the vectorized chain. DVR runs as an on-demand, speculative, in-order, lightweight hardware subthread alongside the main thread within the core and incurs a minimal hardware overhead of only 1139 bytes. Relative to a large superscalar 5-wide out-of-order baseline and Vector Runahead - a recent microarchitectural technique to accelerate indirect memory accesses on out-of-order processors DVR delivers 2.4x and 2x higher performance, respectively, for a set of graph analytics, database, and HPC workloads.
更多
查看译文
关键词
CPU microarchitecture,prefetching,runahead,speculative vectorization,graph processing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要