Electronic Theses and Dissertations

Date of Award

1-1-2023

Document Type

Thesis

Degree Name

M.S. in Engineering Science

First Advisor

Thai Le

Second Advisor

Yixin Chen

Third Advisor

Byunghyun Jang

Relational Format

dissertation/thesis

Abstract

Large Language Models (LLMs), such as Generative Pre-trained Transformers (GPT) and Bidirectional Encoder Representations from Transformers (BERT), have significantly advanced Natural Language Processing (NLP), achieving state-of-the-art results in various tasks. Notably used in systems like Google Search, these models can be domain-tuned through domain-specific pre-training. In Earth Science, where massive data is generated and made publicly available by institutions such as the National Aeronautics and Space Administration (NASA), understanding the usage of such datasets in scientific literature is critical to assessing their scientific impact.

This thesis develops and introduces EarthSciBERT, a domain-specific BERT developed for Earth Science by pre-training BERT on Earth Science literature abstracts and further fine-tuning it for dataset retrieval and ranking in publications. The performance of EarthSciBERT was compared against several BERT models, including the standard BERTBase, a BERT model pre-trained from scratch, and a continually pre-trained BERT from BERT-Base, and also with all these models with the added fine-tuning step. The effectiveness of these models in information retrieval—such as retrieving or suggesting appropriate datasets for Earth Science research—was evaluated using metrics such as Precision at k, Recall at k, Mean Average Precision (MAP), etc. The findings indicate that EarthSciBERT outperforms the original BERT model and other variants when using various metrics. This suggests that domain-adapted BERT models hold promise for specialized information retrieval tasks.

This study offers a novel method for using Machine Learning models for information retrieval tasks to retrieve and rank datasets in Earth Science. It sets the foundation for similar advancements in other scientific domains, such as medicine and biology. This study also contributes to Artificial Intelligence (AI) advancements by highlighting the significant contributions such domain-specific language models can make and advance.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.