Towards Algorithmic Fidelity: Mental Health Representation across Demographics in Synthetic vs. Human-generated Data
CoRR(2024)
摘要
Synthetic data generation has the potential to impact applications and
domains with scarce data. However, before such data is used for sensitive tasks
such as mental health, we need an understanding of how different demographics
are represented in it. In our paper, we analyze the potential of producing
synthetic data using GPT-3 by exploring the various stressors it attributes to
different race and gender combinations, to provide insight for future
researchers looking into using LLMs for data generation. Using GPT-3, we
develop HEADROOM, a synthetic dataset of 3,120 posts about
depression-triggering stressors, by controlling for race, gender, and time
frame (before and after COVID-19). Using this dataset, we conduct semantic and
lexical analyses to (1) identify the predominant stressors for each demographic
group; and (2) compare our synthetic data to a human-generated dataset. We
present the procedures to generate queries to develop depression data using
GPT-3, and conduct analyzes to uncover the types of stressors it assigns to
demographic groups, which could be used to test the limitations of LLMs for
synthetic data generation for depression data. Our findings show that synthetic
data mimics some of the human-generated data distribution for the predominant
depression stressors across diverse demographics.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要