Training BERT Models to Carry Over a Coding System Developed on One Corpus to Another
International Conference on Computational Linguistics(2023)
摘要
This paper describes how we train BERT models to carry over a coding system
developed on the paragraphs of a Hungarian literary journal to another. The aim
of the coding system is to track trends in the perception of literary
translation around the political transformation in 1989 in Hungary. To evaluate
not only task performance but also the consistence of the annotation, moreover,
to get better predictions from an ensemble, we use 10-fold crossvalidation.
Extensive hyperparameter tuning is used to obtain the best possible results and
fair comparisons. To handle label imbalance, we use loss functions and metrics
robust to it. Evaluation of the effect of domain shift is carried out by
sampling a test set from the target domain. We establish the sample size by
estimating the bootstrapped confidence interval via simulations. This way, we
show that our models can carry over one annotation system to the target domain.
Comparisons are drawn to provide insights such as learning multilabel
correlations and confidence penalty improve resistance to domain shift, and
domain adaptation on OCR-ed text on another domain improves performance almost
to the same extent as that on the corpus under study. See our code at
https://codeberg.org/zsamboki/bert-annotator-ensemble.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要