Date of Award
1-1-2023
Document Type
Thesis
Degree Name
M.S. in Engineering Science
First Advisor
Thai Le
Second Advisor
Yixin Chen
Third Advisor
Byunghyun Jang
Relational Format
dissertation/thesis
Abstract
Large Language Models (LLMs), such as Generative Pre-trained Transformers (GPT) and Bidirectional Encoder Representations from Transformers (BERT), have significantly advanced Natural Language Processing (NLP), achieving state-of-the-art results in various tasks. Notably used in systems like Google Search, these models can be domain-tuned through domain-specific pre-training. In Earth Science, where massive data is generated and made publicly available by institutions such as the National Aeronautics and Space Administration (NASA), understanding the usage of such datasets in scientific literature is critical to assessing their scientific impact.
This thesis develops and introduces EarthSciBERT, a domain-specific BERT developed for Earth Science by pre-training BERT on Earth Science literature abstracts and further fine-tuning it for dataset retrieval and ranking in publications. The performance of EarthSciBERT was compared against several BERT models, including the standard BERTBase, a BERT model pre-trained from scratch, and a continually pre-trained BERT from BERT-Base, and also with all these models with the added fine-tuning step. The effectiveness of these models in information retrieval—such as retrieving or suggesting appropriate datasets for Earth Science research—was evaluated using metrics such as Precision at k, Recall at k, Mean Average Precision (MAP), etc. The findings indicate that EarthSciBERT outperforms the original BERT model and other variants when using various metrics. This suggests that domain-adapted BERT models hold promise for specialized information retrieval tasks.
This study offers a novel method for using Machine Learning models for information retrieval tasks to retrieve and rank datasets in Earth Science. It sets the foundation for similar advancements in other scientific domains, such as medicine and biology. This study also contributes to Artificial Intelligence (AI) advancements by highlighting the significant contributions such domain-specific language models can make and advance.
Recommended Citation
Shrestha, Rishabh, "EarthSciBERT: Pre-trained Language Model for Information Retrieval in Earth Science" (2023). Electronic Theses and Dissertations. 2772.
https://egrove.olemiss.edu/etd/2772