OpusTools and Parallel Corpus Diagnostics

Mikko Aulamo,Umut Sulubacak, Sami Virpioja,Jorg Tiedemann

LREC(2020)

引用 0|浏览28
暂无评分
摘要
This paper introduces OpusTools, a package for downloading and processing parallel corpora included in the OPUS corpus collection. The package implements tools for accessing compressed data in their archived release format and make it possible to easily convert between common formats. OpusTools also includes tools for language identification and data filtering as well as tools for importing data from various sources into the OPUS format. We show the use of these tools in parallel corpus creation and data diagnostics. The latter is especially useful for the identification of potential problems and errors in the extensive data set. Using these tools, we can now monitor the validity of data sets and improve the overall quality and consistency of the data collection.
更多
查看译文
关键词
Corpus (Creation, Annotation, etc.),Machine Translation,Tools, Systems, Applications
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要