Prompt-prompted Mixture of Experts for Efficient LLM Generation
arxiv(2024)
摘要
With the development of transformer-based large language models (LLMs), they
have been applied to many fields due to their remarkable utility, but this
comes at a considerable computational cost at deployment. Fortunately, some
methods such as pruning or constructing a mixture of experts (MoE) aim at
exploiting sparsity in transformer feedforward (FF) blocks to gain boosts in
speed and reduction in memory requirements. However, these techniques can be
very costly and inflexible in practice, as they often require training or are
restricted to specific types of architectures. To address this, we introduce
GRIFFIN, a novel training-free MoE that selects unique FF experts at the
sequence level for efficient generation across a plethora of LLMs with
different non-ReLU activation functions. This is possible due to a critical
observation that many trained LLMs naturally produce highly structured FF
activation patterns within a sequence, which we call flocking. Despite our
method's simplicity, we show with 50% of the FF parameters, GRIFFIN maintains
the original model's performance with little to no degradation on a variety of
classification and generation tasks, all while improving latency (e.g.
1.25× speed-up in Llama 2 13B on an NVIDIA L40). Code will be available
at https://github.com/hdong920/GRIFFIN.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要