Analyzing the Impact of Different Real Number Formats on the Structural Reliability of TCUs in GPUs

2023 IFIP/IEEE 31ST INTERNATIONAL CONFERENCE ON VERY LARGE SCALE INTEGRATION, VLSI-SOC(2023)

引用 0|浏览0
暂无评分
摘要
Modern Graphics Processing Units (GPUs) boost the execution of tiled matrix multiplications by extensively using in-chip accelerators (Tensor Core Units or TCUs). Unfortunately, cutting-edge semiconductor technologies are increasingly prone to fault defects. Indeed, faults may affect TCUs when processing massive amounts of data under classical floating-point formats, raising reliability concerns when used in the safety-critical and High-Performance Computing (HPC) domains. In this scenario, the characterization of faulty TCUs supporting different arithmetic formats is still missed. This work for the first time quantitatively evaluates the effects of hardware faults arising in TCU structures when using two different formats for real number representation (i.e., Floating-Point and Posit). For the experimental evaluation, we resort to an architectural description of a TCU core (PyOpenTCU) and perform 60 fault simulation campaigns, injecting 57,344 faults per campaign and requiring around 24 days of computation. The experimental results indicate a relation between the corrupted spatial areas in the output matrices and the TCU's scheduling policies. Moreover, the numeric analysis shows that hardware faults in TCUs in most cases affect up to 2 bits in the output results for both considered formats. The results also demonstrate that the Posit formats are less affected by faults than Floating-Point formats by up to one order of magnitude.
更多
查看译文
关键词
Floating-point numbers,Graphics Processing Units (GPUs),Permanent Faults,Posit numbers,Real number arithmetic,Tensor Core Unit (TCU)
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要