Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements

HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2022(2022)

引用 1|浏览2
暂无评分
摘要
Frameworks for Distributed Deep Learning (DDL) have become popular alternatives to distribute training by adding a few lines of code to a single-node script. From a High-Performance Computing (HPC) perspective, traditional profiling tools for researches in Machine Learning (ML) fail to expose details about distributed training performance, such as identifying synchronization points, communication and computing time, and devices usage throughout the training. Moreover, these results are usually considered independently. We present a methodology for performance analysis of DDL frameworks that combines HPC and ML tools to apply intrusive and non-intrusive tracing to enrich the findings for a strong scaling in three clusters with different GPU models. We selected two modern DDL frameworks: Horovod and Tarantella. Using spatial and temporal analysis, we identify bottlenecks in the frameworks, such as a long initialization time for Horovod, the non-distribution of data during the testing phase for Tarantella. We extract performance measurements using temporal aggregation considering the training phases, which can benefit DDL frameworks' developers to improve their tools. Horovod presented the best scaling efficiency for 4 GPUs or more, with up to 84.6% scaling efficiency for 4 GPUs and large batch size, while Tarantella achieves 54.7% for the same case. Using our temporal aggregation approach, we identified this result origins from Horovod processing an epoch faster than Tarantella.
更多
查看译文
关键词
Distributed Deep Learning, Performance analysis, HPC
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要