Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases
CoRR(2024)
摘要
Speech-to-Text Translation (S2TT) has typically been addressed with cascade
systems, where speech recognition systems generate a transcription that is
subsequently passed to a translation model. While there has been a growing
interest in developing direct speech translation systems to avoid propagating
errors and losing non-verbal content, prior work in direct S2TT has struggled
to conclusively establish the advantages of integrating the acoustic signal
directly into the translation process. This work proposes using contrastive
evaluation to quantitatively measure the ability of direct S2TT systems to
disambiguate utterances where prosody plays a crucial role. Specifically, we
evaluated Korean-English translation systems on a test set containing
wh-phrases, for which prosodic features are necessary to produce translations
with the correct intent, whether it's a statement, a yes/no question, a
wh-question, and more. Our results clearly demonstrate the value of direct
translation systems over cascade translation models, with a notable 12.9
improvement in overall accuracy in ambiguous cases, along with up to a 15.6
increase in F1 scores for one of the major intent categories. To the best of
our knowledge, this work stands as the first to provide quantitative evidence
that direct S2TT models can effectively leverage prosody. The code for our
evaluation is openly accessible and freely available for review and
utilisation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要