Optimised Grouped-Query Attention Mechanism for Transformers
CoRR(2024)
Abstract
Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the
complexity of multi-head attention (MHA). To transform an MHA to a GQA,
neighbour queries in MHA are evenly split into groups where each group shares
the value and key layers. In this work, we propose AsymGQA, an
activation-informed approach to asymmetrically grouping an MHA to a GQA for
better model performance. Our AsymGQA outperforms the GQA within the same model
size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5
on MMLU compared to neighbour grouping. Our approach addresses the GQA's
trade-off problem between model performance and hardware efficiency.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined