On Memory Traffic and Optimisations for Low-order Finite Element Assembly Algorithms on Multi-core CPUs

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE(2022)

引用 1|浏览16
暂无评分
摘要
Motivated by the wish to understand the achievable performance of finite element assembly on unstructured computational meshes, we dissect the standard cellwise assembly algorithm into four kernels, two of which are dominated by irregularmemory traffic. Several optimisation schemes are studied together with associated lower and upper bounds on the estimated memory traffic volume. Apart from properly reordering the mesh entities, the two most significant optimisations include adopting a lookup table in adding element matrices or vectors to their global counterparts, and using a row-wise assembly algorithm for multi-threaded parallelisation. Rigorous benchmarking shows that, due to the various optimisations, the actual volumes of memory traffic are in many cases very close to the estimated lower bounds. These results confirm the effectiveness of the optimisations, while also providing a recipe for developing efficient software for finite element assembly.
更多
查看译文
关键词
Finite element methods,assembly,multi-core,Intel Xeon,AMD Epyc,Cavium TX2
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要