谷歌浏览器插件
订阅小程序
在清言上使用

Abg Corpus

Aline De Lima Benevides, Bruno Ferrari Guide

TEXTO LIVRE-LINGUAGEM E TECNOLOGIA(2017)

引用 2|浏览7
暂无评分
摘要
The present paper presents the task of compiling a linguistic corpus of Brazilian Portuguese, which was undertaken by the authors. It is called ABG Corpus, and this article is also about the computational tools developed for the task. Our main goal is to reunite a large amount of texts, both from spoken and written language to, in the best way possible, represent the Brazilian language in a way that we could use it as a database for our researches, Guide (2016) and Benevides (2017). The ABG corpus has 3.616.625 word tokens and 92.602 types of words, being that 1.938.805 of those tokens are from spoken language corpora and 1.676.820 tokens come from written corpora. Based on the corpus linguistics framework and through the use of computational tools developed using Python, this article shows and provides access to the ABG Corpus, the computational tools (stress marker, phonological structure identifier, syllabifier), as well as some phonological information (stress and syllable related), already present on the corpus. We end by inviting the community to further expand our findings and explore this new tool.
更多
查看译文
关键词
linguistic corpus,computational linguistics,Brazilian Portuguese
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要