Offline Distillation for Robot Lifelong Learning with Imbalanced Experience

ArXiv(2022)

引用 0|浏览2
暂无评分
摘要
Robots will experience non-stationary environment dynamics throughout their lifetime: the robot dynamics can change due to wear and tear, or its surroundings may change over time. Eventually, the robots should perform well in all of the environment variations it has encountered. At the same time, it should still be able to learn fast in a new environment. We investigate two challenges in such a lifelong learning setting: first, existing off-policy algorithms struggle with the trade-off between being conservative to maintain good performance in the old environment and learning efficiently in the new environment. We propose the Offline Distillation Pipeline to break this trade-off by sep-arating the training procedure into interleaved phases of online interaction and offline distillation. Second, training with the combined datasets from multiple environments across the lifetime might create a significant performance drop compared to training on the datasets individually. Our hypothesis is that both the imbalanced quality and size of the datasets exacerbate the extrapolation error of the Q-function during offline training over the “weaker” dataset. We propose a simple fix to the issue by keeping the policy closer to the dataset during the distillation phase. In the experiments, we demonstrate these challenges and the proposed solutions with a simulated bipedal robot walking task across various environment changes. We show that the Offline Distillation Pipeline achieves better performance across all the encountered environments without affecting data collection. We also provide a comprehensive empirical study to support our hypothesis on the data imbalance issue. study the lifelong learning problem in a simulated bipedal walking task, where the goal is to maximize the forward velocity while avoiding falling. Our experiments involve a small humanoid robot, called OP3 1 , that has 20 actuated joints and has been previously used to train walking directly on hardware (Bloesch et al., 2022). All of the experiments in this work are conducted in simulation both due to limited access to hardware and for a more controlled experiment setting. However, we try our best to incorporate all the realistic considerations of the experiments, hoping that it can be deployed on real robots in the future. All of the results are averaged over 3 random seeds.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要