CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers
arxiv(2024)
摘要
The fast-growing large scale language models are delivering unprecedented
performance on almost all natural language processing tasks. However, the
effectiveness of large language models are reliant on an exponentially
increasing number of parameters. The overwhelming computation complexity incurs
a high inference latency that negatively affects user experience. Existing
methods to improve inference efficiency, such as tensor parallelism and
quantization, target to reduce per-layer computing latency, yet overlook the
cumulative latency due to the number of layers. Recent works on reducing the
cumulative latency through layer removing, however, lead to significant
performance drop. Motivated by the similarity of inputs among adjacent layers,
we propose to identify quasi-independent layers, which can be concurrently
computed to significantly decrease inference latency. We also introduce a
bypassing technique to mitigate the effect of information loss. Empirical
experiments of the proposed approach on the LLaMA models confirm that
Concurrent Computation of Quasi-Independent Layers (CQIL) can reduce latency by
up to 48.3
performance.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要