Opera Graeca Adnotata: Building a 34M+ Token Multilayer Corpus for Ancient Greek
CoRR(2024)
摘要
In this article, the beta version 0.1.0 of Opera Graeca Adnotata (OGA), the
largest open-access multilayer corpus for Ancient Greek (AG) is presented. OGA
consists of 1,687 literary works and 34M+ tokens coming from the PerseusDL and
OpenGreekAndLatin GitHub repositories, which host AG texts ranging from about
800 BCE to about 250 CE. The texts have been enriched with seven annotation
layers: (i) tokenization layer; (ii) sentence segmentation layer; (iii)
lemmatization layer; (iv) morphological layer; (v) dependency layer; (vi)
dependency function layer; (vii) Canonical Text Services (CTS) citation layer.
The creation of each layer is described by highlighting the main technical and
annotation-related issues encountered. Tokenization, sentence segmentation, and
CTS citation are performed by rule-based algorithms, while morphosyntactic
annotation is the output of the COMBO parser trained on the data of the Ancient
Greek Dependency Treebank. For the sake of scalability and reusability, the
corpus is released in the standoff formats PAULA XML and its offspring LAULA
XML.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要