In-depth evaluation of Romanian natural language processing pipelines

ROMANIAN JOURNAL OF INFORMATION SCIENCE AND TECHNOLOGY(2021)

引用 0|浏览9
暂无评分
摘要
With the increased size of Universal Dependencies tree banks, several basic language processing kits (BLARK) for multiple languages appeared in recent years, indicating improved performances on different languages. Nevertheless, published results are not directly comparable for the Romanian language since different tools make use of different Universal Dependencies versions and different additional resources, such as pre-trained word embeddings. In this paper, we re-train several state-of-the-art tools for processing Romanian language by using a common methodology comprising of training and evaluating on the same version of RoRefTrees corpus and using the same pre-trained word embeddings from the representative corpus of contemporary Romanian language (CoRoLa). Furthermore, we also explore the capabilities of the trained models when faced with unseen text from a different domain. For this purpose, we further test the resulting model on the SiMoNERo corpus. We employ different metrics to assess the performance on operations like tokenization, sentence splitting, lemmatization, part-of-speech tagging and dependency parsing.
更多
查看译文
关键词
Natural language processing, BLARK, performance evaluation, Romanian text processing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要