ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining.

Nguyen Phuc Minh,Tran Hoang Vu,Vu Hoang,Ta Duc Huy,Trung Huu Bui,Steven Quoc Hung Truong

International Conference on Language Resources and Evaluation (LREC)（2022）

Cited 3|Views4

No score

Abstract

Pre-trained language models have become crucial to achieving competitive results across many Natural Language Processing (NLP) problems. For monolingual pre-trained models in low-resource languages, the quantity has been significantly increased. However, most of them relate to the general domain, and there are limited strong baseline language models for domain-specific. We introduce ViHealthBERT, the first domain-specific pre-trained language model for Vietnamese healthcare. The performance of our model shows strong results while outperforming the general domain language models in all health-related datasets. Moreover, we also present Vietnamese datasets for the healthcare domain for two tasks: Acronym Disambiguation (AD) and Frequently Asked Questions (FAQ) Summarization. We release ViHealthBERT to facilitate future research and downstream applications for Vietnamese NLP in domain-specific. Our dataset and code are available in https://github.com/demdecuong/vihealthbert.

Translated text

Key words

Low-resource language, language model, healthcare, acronym disambiguation, summarization * denotes equal contribution

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined