Provably learning a multi-head attention layer
CoRR(2024)
摘要
The multi-head attention layer is one of the key components of the
transformer architecture that sets it apart from traditional feed-forward
models. Given a sequence length k, attention matrices
Θ_1,…,Θ_m∈ℝ^d× d, and
projection matrices 𝐖_1,…,𝐖_m∈ℝ^d×
d, the corresponding multi-head attention layer F: ℝ^k× d→ℝ^k× d transforms length-k sequences of d-dimensional
tokens 𝐗∈ℝ^k× d via F(𝐗) ≜∑^m_i=1softmax(𝐗Θ_i𝐗^⊤)𝐗𝐖_i.
In this work, we initiate the study of provably learning a multi-head attention
layer from random examples and give the first nontrivial upper and lower bounds
for this problem:
- Provided {𝐖_i, Θ_i} satisfy certain
non-degeneracy conditions, we give a (dk)^O(m^3)-time algorithm that learns
F to small error given random labeled examples drawn uniformly from {±
1}^k× d.
- We prove computational lower bounds showing that in the worst case,
exponential dependence on m is unavoidable.
We focus on Boolean 𝐗 to mimic the discrete nature of tokens in
large language models, though our techniques naturally extend to standard
continuous settings, e.g. Gaussian. Our algorithm, which is centered around
using examples to sculpt a convex body containing the unknown parameters, is a
significant departure from existing provable algorithms for learning
feedforward networks, which predominantly exploit algebraic and rotation
invariance properties of the Gaussian distribution. In contrast, our analysis
is more flexible as it primarily relies on various upper and lower tail bounds
for the input distribution and "slices" thereof.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要