Bias And Statistical Significance In Evaluating Speech Synthesis With Mean Opinion Scores

Andrew Rosenberg,Bhuvana Ramabhadran

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION（2017）

引用 43|浏览51

暂无评分

摘要

Listening tests and Mean Opinion Scores (MOS) are the most commonly used techniques for the evaluation of speech synthesis quality and naturalness. These arc invaluable in the assessment of subjective qualities of machine generated stimuli. However. there are a number of challenges in understanding the MOS scores that come out of listening tests.Primarily, we advocate for the use of non-parametric statistical tests in the calculation of statistical significance when comparing listening test results.Additionally, based on the results of 46 legacy listening tests, we measure the impact of two sources of bias. Bias introduced by individual participants and synthesized text can a dramatic impact on observed MOS scores. For example, we find that on average the mean difference between the highest and lowest scoring rater is over 2 MOS points (on a 5 point scale). From this observation, we caution against using any statistical test without adjusting for this bias, and provide specific non-parametric recommendations.

查看译文

关键词

speech synthesis, listening tests, mean opinion score

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要