Developing a Greek Biomedical Corpus towards Text Mining


引用 26|浏览3
The project IATROLEXI ( aims at the creation of the critical infrastructure for the Greek language which will constitute the groundwork for advanced NLP applications in the domain of biomedicine, i.e. text indexing, information extraction and retrieval, text mining, question answering systems, etc. To accomplish this, a number of essential tools and resources will be constructed for the Greek language, which will allow better management and processing of the information in the biomedical field. This will be made possible through the compilation of a representative corpus of biomedical texts and the construction of NLP tools for structural, lexical and semantic annotation of those texts. In this paper, we present the design and compilation of the Greek biomedical corpus. The collection criteria of the texts were originally imposed by the project requirements: the corpus should comprise of written texts only. Due to time constraints, downloading texts from websites was proved to be the only viable and certainly less time consuming solution. Overall, forty Greek websites were identified to contain appropriate medical documents for IATROLEXI. Most of the documents are paper abstracts, full papers, and conference proceedings. The majority of them, apart from the body text, contained additional information like images, tables, graphical representations, etc. The total number of documents that were collected up to now is approximately 6,250, from which the 69.8 percent is in hypertext markup language (.html) while the rest (30.2 percent) is in portable document format (.pdf).
nlp,corpus linguistics,biomedical terminology,biomedical corpus,markup language,text mining,information extraction,critical infrastructure,question answering system
AI 理解论文
Chat Paper