Long-Term Reliability Management For Multitasking GPGPUs

2019 16th International Conference on Synthesis, Modeling, Analysis and Simulation Methods and Applications to Circuit Design (SMACD)(2019)

引用 0|浏览14
暂无评分
摘要
This paper proposes long-term reliability management for spatial multitasking GPU architectures. Specifically, we focus on electromigration (EM)-induced long-term failure of the GPU's power delivery network. A distributed power delivery network model at functional unit granularity is developed and used for our EM analysis of GPU architectures. We use a recently proposed physics-based EM reliability model and consider the EM-induced time-to-failure at the GPU system level as a reliability resource. For GPU scheduling, we mainly focus on spatial multitasking, which allows GPU computing resources to be partitioned among multiple applications. We find that the existing reliability-agnostic thread block scheduler for spatial multitasking is effective in achieving high GPU utilization, but poor reliability. We develop and implement a long-term reliability-aware thread block scheduler in GPGPU-Sim, and compare it against existing reliability-agnostic scheduler. We evaluate several use cases of spatial multitasking and find that our proposed scheduler achieves up to 30% improvement in long-term reliability.
更多
查看译文
关键词
long-term reliability management,multitasking GPGPUs,spatial multitasking GPU architectures,electromigration-induced long-term failure,GPU's power delivery network,distributed power delivery network model,EM-induced time-to-failure,GPU system level,reliability resource,GPU scheduling,high GPU utilization,long-term reliability-aware thread block scheduler,reliability-agnostic thread block scheduler,physics-based EM reliability model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要