Corpus Building for Low Resource Languages in the DARPA LORELEI Program

Jennifer Tracey,Stephanie Strassel, Ann Bies,Zhiyi Song, Michael Arrigo, Kira Griffitt, Dana Delgado, Dave Graff,Seth Kulick, Justin Mott,Neil Kuster

semanticscholar(2019)

引用 2|浏览4
暂无评分
摘要
We describe corpora for the LORELEI (Low Resource Languages for Emergent Incidents) Program, whose goal is to build human language technologies to provide situational awareness during emergent incidents, with a particular focus on low resource languages. Incident Language packs are used for system development and testing in machine translation, entity disambiguation and linking, and the “situation frame” task, which requires aggregation of information about the emergent incident. Incident languages, as well as the incidents themselves, remain unknown until the evaluation begins, and no labeled training data is provided; systems developers must rapidly adapt technology for the incident language and return initial results within 24 hours. Given this surprise language evaluation scenario, Representative Language packs are designed to support research into cross-language projection and language universals rather than to provide training data. They contain large volumes of monolingual and parallel text, basic annotations, lexical resources and simple NLP tools for 23 languages selected for typological diversity and coverage. We discuss the creation of the LORELEI language packs with a special focus on resources for machine translation, as well as techniques for maintaining consistency across the language packs. © 2019 The authors. This article is licensed under a Creative Commons 4.0 license, no derivative works, attribution, CCBY-ND.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要