Fault Awareness in the MPI 4.0 Session Model

PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2023, CF 2023(2023)

引用 1|浏览9
暂无评分
摘要
MPI version 4.0 introduces new functionalities like the Session model but still lacks fault management mechanisms. Past efforts produced tools and MPI standard extensions to manage fault presence, including User Level Fault Mitigation (ULFM). These measures are effective against faults but do not fully support the new additions to the standard. In this paper, we combine the fault management possibilities of ULFM with the new Session model functionality introduced in version 4.0 of the standard. We focus on the communicator creation procedure, highlighting criticalities and proposing a method to circumvent them. The experimental campaign shows that the proposed solution does not significantly affect execution times and scalability while better managing the arise of faults.
更多
查看译文
关键词
HPC,MPI Sessions,ULFM,Fault Tolerance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要