Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models
CoRR(2024)
摘要
We present Inferflow, an efficient and highly configurable inference engine
for large language models (LLMs). With Inferflow, users can serve most of the
common transformer models by simply modifying some lines in corresponding
configuration files, without writing a single line of source code. Compared
with most existing inference engines, Inferflow has some key features. First,
by implementing a modular framework of atomic build-blocks and technologies,
Inferflow is compositionally generalizable to new models. Second, 3.5-bit
quantization is introduced in Inferflow as a tradeoff between 3-bit and 4-bit
quantization. Third, hybrid model partitioning for multi-GPU inference is
introduced in Inferflow to better balance inference speed and throughput than
the existing partition-by-layer and partition-by-tensor strategies.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要