Open Generative Large Language Models for Galician
CoRR(2024)
Abstract
Large language models (LLMs) have transformed natural language processing.
Yet, their predominantly English-centric training has led to biases and
performance disparities across languages. This imbalance marginalizes
minoritized languages, making equitable access to NLP technologies more
difficult for languages with lower resources, such as Galician. We present the
first two generative LLMs focused on Galician to bridge this gap. These models,
freely available as open-source resources, were trained using a GPT
architecture with 1.3B parameters on a corpus of 2.1B words. Leveraging
continual pretraining, we adapt to Galician two existing LLMs trained on larger
corpora, thus mitigating the data constraints that would arise if the training
were performed from scratch. The models were evaluated using human judgments
and task-based datasets from standardized benchmarks. These evaluations reveal
a promising performance, underscoring the importance of linguistic diversity in
generative models.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined