Geoscience Knowledge Understanding and Utilization via Data-centric Large Language Model

crossref(2024)

Cited 0|Views17
No score
Abstract
Large language models (LLMs) have made substantial progress in general natural language processing domains. GeoLM represents a significant stride in adapting LLMs for geoscience, with the goal of enhancing research and practical applications in this specialized area. We have developed two distinct models: a 7-billion-parameter LLM named K2, which is trained on a 5.5-billion-token geoscience text corpus that includes over 1 million pieces of geoscience literature, and a 30-billion-parameter LLM, GeoGalactica, trained on an extensive 65-billion-token corpus related to geoscience. Supported by the Deep-time Digital Earth (DDE) project, we preserve the largest text corpus specifically designed for geoscience. The efficacy of LLMs in the geoscience domain is fundamentally linked to the access to and deep understanding of extensive geoscience data. In this respect, data-centric AI is crucial. We put forward a framework, GeoLM, to tackle the challenges of data science within geosciences, integrating techniques such as information extraction, data integration, and mining. The GeoLM framework is dedicated to constructing and applying data-centric Geoscience LLMs, with the aim of enabling the wider scientific community to harness these advanced models for a more profound understanding and effective application of geoscience knowledge.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined