Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost
CoRR(2024)
摘要
Current foundation models have shown impressive performance across various
tasks. However, several studies have revealed that these models are not
effective for everyone due to the imbalanced geographical and economic
representation of the data used in the training process. Most of this data
comes from Western countries, leading to poor results for underrepresented
countries. To address this issue, more data needs to be collected from these
countries, but the cost of annotation can be a significant bottleneck. In this
paper, we propose methods to identify the data to be annotated to balance model
performance and annotation costs. Our approach first involves finding the
countries with images of topics (objects and actions) most visually distinct
from those already in the training datasets used by current large
vision-language foundation models. Next, we identify countries with higher
visual similarity for these topics and show that using data from these
countries to supplement the training data improves model performance and
reduces annotation costs. The resulting lists of countries and corresponding
topics are made available at
https://github.com/MichiganNLP/visual_diversity_budget.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要