Language and Speech Technology for Central Kurdish Varieties
arxiv(2024)
摘要
Kurdish, an Indo-European language spoken by over 30 million speakers, is
considered a dialect continuum and known for its diversity in language
varieties. Previous studies addressing language and speech technology for
Kurdish handle it in a monolithic way as a macro-language, resulting in
disparities for dialects and varieties for which there are few resources and
tools available. In this paper, we take a step towards developing resources for
language and speech technology for varieties of Central Kurdish, creating a
corpus by transcribing movies and TV series as an alternative to fieldwork.
Additionally, we report the performance of machine translation, automatic
speech recognition, and language identification as downstream tasks evaluated
on Central Kurdish varieties. Data and models are publicly available under an
open license at https://github.com/sinaahmadi/CORDI.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要