EarthSciBERT: Pre-trained Language Model for Information Retrieval in Earth Science

Rishabh Shrestha, University of MississippiFollow

Date of Award

1-1-2023

Document Type

Thesis

Degree Name

M.S. in Engineering Science

First Advisor

Thai Le

Second Advisor

Yixin Chen

Third Advisor

Byunghyun Jang

Relational Format

dissertation/thesis

Abstract

Large Language Models (LLMs), such as Generative Pre-trained Transformers (GPT) and Bidirectional Encoder Representations from Transformers (BERT), have significantly advanced Natural Language Processing (NLP), achieving state-of-the-art results in various tasks. Notably used in systems like Google Search, these models can be domain-tuned through domain-specific pre-training. In Earth Science, where massive data is generated and made publicly available by institutions such as the National Aeronautics and Space Administration (NASA), understanding the usage of such datasets in scientific literature is critical to assessing their scientific impact.

This thesis develops and introduces EarthSciBERT, a domain-specific BERT developed for Earth Science by pre-training BERT on Earth Science literature abstracts and further fine-tuning it for dataset retrieval and ranking in publications. The performance of EarthSciBERT was compared against several BERT models, including the standard BERTBase, a BERT model pre-trained from scratch, and a continually pre-trained BERT from BERT-Base, and also with all these models with the added fine-tuning step. The effectiveness of these models in information retrieval—such as retrieving or suggesting appropriate datasets for Earth Science research—was evaluated using metrics such as Precision at k, Recall at k, Mean Average Precision (MAP), etc. The findings indicate that EarthSciBERT outperforms the original BERT model and other variants when using various metrics. This suggests that domain-adapted BERT models hold promise for specialized information retrieval tasks.

This study offers a novel method for using Machine Learning models for information retrieval tasks to retrieve and rank datasets in Earth Science. It sets the foundation for similar advancements in other scientific domains, such as medicine and biology. This study also contributes to Artificial Intelligence (AI) advancements by highlighting the significant contributions such domain-specific language models can make and advance.

Recommended Citation

Shrestha, Rishabh, "EarthSciBERT: Pre-trained Language Model for Information Retrieval in Earth Science" (2023). Electronic Theses and Dissertations. 2772.
https://egrove.olemiss.edu/etd/2772

Download

COinS

EarthSciBERT: Pre-trained Language Model for Information Retrieval in Earth Science

Date of Award

Document Type

Degree Name

First Advisor

Second Advisor

Third Advisor

Relational Format

Abstract

Recommended Citation

Browse

Search

Author Corner

Additional Information

EarthSciBERT: Pre-trained Language Model for Information Retrieval in Earth Science

Author

Date of Award

Document Type

Degree Name

First Advisor

Second Advisor

Third Advisor

Relational Format

Abstract

Recommended Citation

Share

Browse

Search

Author Corner

Additional Information