Language models scale reliably with over-training and on downstream tasks
arxiv(2024)
摘要
Scaling laws are useful guides for derisking expensive training runs, as they
predict performance of large models using cheaper, small-scale experiments.
However, there remain gaps between current scaling studies and how language
models are ultimately trained and evaluated. For instance, scaling is usually
studied in the compute-optimal training regime (i.e., "Chinchilla optimal"
regime). In contrast, models are often over-trained to reduce inference costs.
Moreover, scaling laws mostly predict loss on next-token prediction, but models
are usually compared on downstream task performance. To address both
shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters
trained with various numbers of tokens on three data distributions. First, we
fit scaling laws that extrapolate in both the amount of over-training and the
number of model parameters. This enables us to predict the validation loss of a
1.4B parameter, 900B token run (i.e., 32× over-trained) and a 6.9B
parameter, 138B token run (i.e., a compute-optimal run)x2014each
from experiments that take 300× less compute. Second, we relate the
perplexity of a language model to its downstream task performance by proposing
a power law. We use this law to predict top-1 error averaged over downstream
tasks for the two aforementioned models, using experiments that take 20×
less compute. Our experiments are available at
https://github.com/mlfoundations/scaling.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要