Language Model for Statistics Domain

Young-Seob Jeong, EunJin Kim,JunHa Hwang, Medard E. Mswahili, YoungJin Kim

2022 5th International Conference on Artificial Intelligence for Industries (AI4I)(2022)

引用 0|浏览6
暂无评分
摘要
Since transformer has appeared, there were many studies that proposed variants of some representative language models (e.g., Bidirectional Encoder Representations from Transformers (BERT) [1] and Generative Pre-Training (GPT) series [2]). Huge language models are appearing recently (e.g., Chinchilla [3], Megatron LM), whereas there are studies of domain-specific (or language-specific) language models. For example, BioBERT for bio-informatics [4], SwahBERT for Swahili language [5], and FinBERT for financial domain [6]. Without doubt, statistics must be one of the domains with many collected data (e.g., reports of statistics). Pre-trained language model for the statistic domain will probably deliver much performance improvement in down-stream tasks such as industry code classification and job code classification, and more accurate system for the code classification tasks will contribute to better national statistics and taxation. Indeed, many countries are trying to develop such system, and this paper summarizes some relevant findings and provides suggestions to develop language models for statistics domain.
更多
查看译文
关键词
Language model,Transformer,Statistics,Domain specific language model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要