Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal
arxiv(2024)
摘要
This work is part of the Kallaama project, whose objective is to produce and
disseminate national languages corpora for speech technologies developments, in
the field of agriculture. Except for Wolof, which benefits from some language
data for natural language processing, national languages of Senegal are largely
ignored by language technology providers. However, such technologies are keys
to the protection, promotion and teaching of these languages. Kallaama focuses
on the 3 main spoken languages by Senegalese people: Wolof, Pulaar and Sereer.
These languages are widely spoken by the population, with around 10 million of
native Senegalese speakers, not to mention those outside the country. However,
they remain under-resourced in terms of machine-readable data that can be used
for automatic processing and language technologies, all the more so in the
agricultural sector. We release a transcribed speech dataset containing 125
hours of recordings, about agriculture, in each of the above-mentioned
languages. These resources are specifically designed for Automatic Speech
Recognition purpose, including traditional approaches. To build such
technologies, we provide textual corpora in Wolof and Pulaar, and a
pronunciation lexicon containing 49,132 entries from the Wolof dataset.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要