Cached Transformers: Improving Transformers with Differentiable Memory Cachde

AAAI 2024(2024)

引用 0|浏览6
暂无评分
摘要
This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.
更多
查看译文
关键词
ML: Transfer, Domain Adaptation, Multi-Task Learning,ML: Deep Learning Algorithms,ML: Transparent, Interpretable, Explainable ML,ML: Unsupervised & Self-Supervised Learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要