Learning in temporally structured environments

ICLR 2023(2023)

引用 3|浏览32
暂无评分
摘要
Natural environments have temporal structure at multiple timescales, a property that is reflected in biological learning and memory but typically not in machine learning systems. This paper advances a multiscale learning model in which each weight in a neural network is decomposed into a sum of subweights learning independently with different learning and decay rates. Thus knowledge becomes distributed across different timescales, enabling rapid adaptation to task changes while avoiding catastrophic interference with older knowledge. First, we prove that previous models that learn at multiple timescales, but with complex coupling between timescales, are formally equivalent to the multiscale learner via a reparameterization that eliminates this coupling. Thus the multiscale learning offers a unifying framework that is conceptually and computationally simpler than past work. The same analysis also offers a new characterization of momentum learning, as a fast weight with a negative learning rate. Second, we derive a model of Bayesian inference in environments governed by $1/f$ noise, a common pattern in both natural and human-generated environments that involves long-range (power law) autocorrelations. The model works by applying a Kalman filter to jointly infer dynamics at multiple timescales. We then derive a variational approximation to the Bayesian model and show that it is equivalent to the multiscale learner. Third, we evaluate the models in synthetic online prediction tasks characterized by $1/f$ noise in the latent parameters of the environment. We find that the Bayesian model significantly outperforms stochastic gradient descent (which effectively learns at only one timescale) and a batch heuristic that predicts each timestep based on a fixed horizon of past observations (motivated by the idea that older data have gone stale). Moreover, the multiscale learner with parameters obtained from the variational approximation performs nearly as well as the full Bayesian model, and with memory requirements that are linear in the size of the network (vs. quadratic for the Bayesian model). Future work will incorporate the multiscale learner as an optimizer in deep networks to explore their ability to learn in rich temporally structured environments.
更多
查看译文
关键词
1/f noise,Kalman filter,neural network,learning theory,optimizers
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要