Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
CoRR(2024)
摘要
Many computational factors limit broader deployment of large language models.
In this paper, we focus on a memory bottleneck imposed by the key-value (KV)
cache, a computational shortcut that requires storing previous KV pairs during
decoding. While existing KV cache methods approach this problem by pruning or
evicting large swaths of relatively less important KV pairs to dramatically
reduce the memory footprint of the cache, they can have limited success in
tasks that require recollecting a majority of previous tokens. To alleviate
this issue, we propose LESS, a simple integration of a (nearly free) constant
sized cache with eviction-based cache methods, such that all tokens can be
queried at later decoding steps. Its ability to retain information throughout
time shows merit on a variety of tasks where we demonstrate LESS can help
reduce the performance gap from caching everything, sometimes even matching it,
all while being efficient.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要