Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective
CoRR(2024)
摘要
Adaptive gradient optimizers like Adam(W) are the default training algorithms
for many deep learning architectures, such as transformers. Their diagonal
preconditioner is based on the gradient outer product which is incorporated
into the parameter update via a square root. While these methods are often
motivated as approximate second-order methods, the square root represents a
fundamental difference. In this work, we investigate how the behavior of
adaptive methods changes when we remove the root, i.e. strengthen their
second-order motivation. Surprisingly, we find that such square-root-free
adaptive methods close the generalization gap to SGD on convolutional
architectures, while maintaining their root-based counterpart's performance on
transformers. The second-order perspective also has practical benefits for the
development of adaptive methods with non-diagonal preconditioner. In contrast
to root-based counterparts like Shampoo, they do not require numerically
unstable matrix square roots and therefore work well in low precision, which we
demonstrate empirically. This raises important questions regarding the
currently overlooked role of adaptivity for the success of adaptive methods.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要