Data, Data Everywhere: A Guide for Pretraining Dataset Construction
arxiv(2024)
Abstract
The impressive capabilities of recent language models can be largely
attributed to the multi-trillion token pretraining datasets that they are
trained on. However, model developers fail to disclose their construction
methodology which has lead to a lack of open information on how to develop
effective pretraining sets. To address this issue, we perform the first
systematic study across the entire pipeline of pretraining set construction.
First, we run ablations on existing techniques for pretraining set development
to identify which methods translate to the largest gains in model accuracy on
downstream evaluations. Then, we categorize the most widely used data source,
web crawl snapshots, across the attributes of toxicity, quality, type of
speech, and domain. Finally, we show how such attribute information can be used
to further refine and improve the quality of a pretraining set. These findings
constitute an actionable set of steps that practitioners can use to develop
high quality pretraining sets.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined