WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset
CoRR(2024)
摘要
This paper presents WanJuan-CC, a safe and high-quality open-sourced English
webtext dataset derived from Common Crawl data. The study addresses the
challenges of constructing large-scale pre-training datasets for language
models, which require vast amounts of high-quality data. A comprehensive
process was designed to handle Common Crawl data, including extraction,
heuristic rule filtering, fuzzy deduplication, content safety filtering, and
data quality filtering. From approximately 68 billion original English
documents, we obtained 2.22T Tokens of safe data and selected 1.0T Tokens of
high-quality data as part of WanJuan-CC. We have open-sourced 100B Tokens from
this dataset. The paper also provides statistical information related to data
quality, enabling users to select appropriate data according to their needs. To
evaluate the quality and utility of the dataset, we trained 1B-parameter and
3B-parameter models using WanJuan-CC and another dataset, RefinedWeb. Results
show that WanJuan-CC performs better on validation datasets and downstream
tasks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要