AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods
CoRR(2024)
摘要
The choice of batch sizes in stochastic gradient optimizers is critical for
model training. However, the practice of varying batch sizes throughout the
training process is less explored compared to other hyperparameters. We
investigate adaptive batch size strategies derived from adaptive sampling
methods, traditionally applied only in stochastic gradient descent. Given the
significant interplay between learning rates and batch sizes, and considering
the prevalence of adaptive gradient methods in deep learning, we emphasize the
need for adaptive batch size strategies in these contexts. We introduce
AdAdaGrad and its scalar variant AdAdaGradNorm, which incrementally increase
batch sizes during training, while model updates are performed using AdaGrad
and AdaGradNorm. We prove that AdaGradNorm converges with high probability at a
rate of 𝒪(1/K) for finding a first-order stationary point of smooth
nonconvex functions within K iterations. AdaGrad also demonstrates similar
convergence properties when integrated with a novel coordinate-wise variant of
our adaptive batch size strategies. Our theoretical claims are supported by
numerical experiments on various image classification tasks, highlighting the
enhanced adaptability of progressive batching protocols in deep learning and
the potential of such adaptive batch size strategies with adaptive gradient
optimizers in large-scale model training.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要