The Impact of Training Data Quality on Automated Content Scoring Performance

semanticscholar(2020)

引用 0|浏览3
暂无评分
摘要
With the advent of advanced natural language processing methods and their application to evaluating constructed responses, automated scoring of content has rapidly become a potential alternative to human rating of constructed responses to prompts that measure specific content knowledge. In this paper, we conduct experiments using scored responses to almost 400 prompts collected from four different assessments in order to better understand how training data quality, particularly in terms of training sample size, Human-Human agreement (H-H agreement, i.e., the correlation between two independent human scores for the same prompt) and average response length, relate to system performance that is measured by Quadratic Weighted Kappa (QWK) between human ratings and machine predictions. Not surprisingly, we find that H-H agreement has a substantial impact on the system performance, though regardless of H-H agreement, increasing the training sample size improves the accuracy of the predictions. Our results can potentially provide additional helpful guidelines to researchers and practitioners about factors that most influence the performance of automated content-scoring
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要