Length Generalization of Causal Transformers without Position Encoding
arxiv(2024)
摘要
Generalizing to longer sentences is important for recent Transformer-based
language models. Besides algorithms manipulating explicit position features,
the success of Transformers without position encodings (NoPE) provides a new
way to overcome the challenge. In this paper, we study the length
generalization property of NoPE. We find that although NoPE can extend to
longer sequences than the commonly used explicit position encodings, it still
has a limited context length. We identify a connection between the failure of
NoPE's generalization and the distraction of attention distributions. We
propose a parameter-efficient tuning for searching attention heads' best
temperature hyper-parameters, which substantially expands NoPE's context size.
Experiments on long sequence language modeling, the synthetic passkey retrieval
task and real-world long context tasks show that NoPE can achieve competitive
performances with state-of-the-art length generalization algorithms. The source
code is publicly accessible
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要