From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers
CoRR(2024)
摘要
Modern language models rely on the transformer architecture and attention
mechanism to perform language understanding and text generation. In this work,
we study learning a 1-layer self-attention model from a set of prompts and
associated output data sampled from the model. We first establish a precise
mapping between the self-attention mechanism and Markov models: Inputting a
prompt to the model samples the output token according to a context-conditioned
Markov chain (CCMC) which weights the transition matrix of a base Markov chain.
Additionally, incorporating positional encoding results in position-dependent
scaling of the transition probabilities. Building on this formalism, we develop
identifiability/coverage conditions for the prompt distribution that guarantee
consistent estimation and establish sample complexity guarantees under IID
samples. Finally, we study the problem of learning from a single output
trajectory generated from an initial prompt. We characterize an intriguing
winner-takes-all phenomenon where the generative process implemented by
self-attention collapses into sampling a limited subset of tokens due to its
non-mixing nature. This provides a mathematical explanation to the tendency of
modern LLMs to generate repetitive text. In summary, the equivalence to CCMC
provides a simple but powerful framework to study self-attention and its
properties.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要