Robust Class Parallelism Error Resilient Parallel Inference with Low Communication Cost

ACSSC(2020)

引用 2|浏览112
暂无评分
摘要
Model parallelism is a standard paradigm to decouple a deep neural network (DNN) into sub-nets when the model is large. Recent advances in class parallelism significantly reduce the communication overhead of model parallelism to a single floating-point number per iteration. However, traditional faulttolerance schemes, when applied to class parallelism, require storing the entire model on the hard disk. Thus, these schemes are not suitable for soft and frequent system noise such as stragglers (temporarily slow worker machines). In this paper, we propose an erasure-coding based redundant computing technique called robust class parallelism to improve the error resilience of model parallelism. We show that by introducing slight overhead in the computation at each machine, we can obtain robustness to soft system noise while maintaining the low communication overhead in class parallelism. More importantly, we show that on standard classification tasks, robust class parallelism maintains the stateof-the-art performance.
更多
查看译文
关键词
Distributed computing, deep learning, system robustness, computation redundancy
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要