Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
arxiv(2024)
摘要
Scaling laws describe the relationship between the size of language models
and their capabilities. Unlike prior studies that evaluate a model's capability
via loss or benchmarks, we estimate the number of knowledge bits a model
stores. We focus on factual knowledge represented as tuples, such as (USA,
capital, Washington D.C.) from a Wikipedia page. Through multiple controlled
datasets, we establish that language models can and only can store 2 bits of
knowledge per parameter, even when quantized to int8, and such knowledge can be
flexibly extracted for downstream applications. Consequently, a 7B model can
store 14B bits of knowledge, surpassing the English Wikipedia and textbooks
combined based on our estimation.
More broadly, we present 12 results on how (1) training duration, (2) model
architecture, (3) quantization, (4) sparsity constraints such as MoE, and (5)
data signal-to-noise ratio affect a model's knowledge storage capacity. Notable
insights include:
* The GPT-2 architecture, with rotary embedding, matches or even surpasses
LLaMA/Mistral architectures in knowledge storage, particularly over shorter
training durations. This arises because LLaMA/Mistral uses GatedMLP, which is
less stable and harder to train.
* Prepending training data with domain names (e.g., wikipedia.org)
significantly increases a model's knowledge capacity. Language models can
autonomously identify and prioritize domains rich in knowledge, optimizing
their storage capacity.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要