Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance
CoRR(2024)
Abstract
Imbalanced data and spurious correlations are common challenges in machine
learning and data science. Oversampling, which artificially increases the
number of instances in the underrepresented classes, has been widely adopted to
tackle these challenges. In this article, we introduce OPAL
(OversamPling with Artificial LLM-generated
data), a systematic oversampling approach that leverages the capabilities of
large language models (LLMs) to generate high-quality synthetic data for
minority groups. Recent studies on synthetic data generation using deep
generative models mostly target prediction tasks. Our proposal differs in that
we focus on handling imbalanced data and spurious correlations. More
importantly, we develop a novel theory that rigorously characterizes the
benefits of using the synthetic data, and shows the capacity of transformers in
generating high-quality synthetic data for both labels and covariates. We
further conduct intensive numerical experiments to demonstrate the efficacy of
our proposed approach compared to some representative alternative solutions.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined