Compression by the signs - distributed learning is a two-way street.

ICLR(2018)

引用 23|浏览16
暂无评分
摘要
Training large neural networks requires distributing learning over multiple workers. The rate limiting step is often in sending gradients from workers to parameter server and back again. We present SIGNSGD with majority vote: the first gradient compression scheme to achieve 1-bit compression of worker-server communication in both directions with non-vacuous theoretical guarantees. To achieve this, we build an extensive theory of sign-based optimisation, which is also relevant to understanding adaptive gradient methods like ADAM and RMSPROP. We prove that SIGNSGD can get the best of both worlds: compressed gradients and SGDlevel convergence rate. SIGNSGD can exploit mismatches between `1 and `2 geometry: when noise and curvature are much sparser than the gradients, SIGNSGD is expected to converge at the same rate or faster than full-precision SGD. Measurements of the `1 versus `2 geometry of real networks support our theoretical claims, and we find that the momentum counterpart of SIGNSGD is able to match the accuracy and convergence speed of ADAM on deep Imagenet models.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要