On Pretraining Data Diversity for Self-Supervised Learning
arxiv(2024)
摘要
We explore the impact of training with more diverse datasets, characterized
by the number of unique samples, on the performance of self-supervised learning
(SSL) under a fixed computational budget. Our findings consistently demonstrate
that increasing pretraining data diversity enhances SSL performance, albeit
only when the distribution distance to the downstream data is minimal. Notably,
even with an exceptionally large pretraining data diversity achieved through
methods like web crawling or diffusion-generated data, among other ways, the
distribution shift remains a challenge. Our experiments are comprehensive with
seven SSL methods using large-scale datasets such as ImageNet and YFCC100M
amounting to over 200 GPU days. Code and trained models will be available at
https://github.com/hammoudhasan/DiversitySSL .
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要