Honors Theses
Date of Award
Spring 5-12-2023
Document Type
Undergraduate Thesis
Department
Computer and Information Science
First Advisor
Yixin Chen
Second Advisor
Feng Wang
Third Advisor
Thai Le
Relational Format
Dissertation/Thesis
Abstract
Research related to Biology often utilizes machine learning models that are ultimately uninterpretable by the researcher. It would be helpful if researchers could leverage the same computing power but instead gain specific insight into decision-making to gain a deeper understanding of their domain knowledge. This paper seeks to select features and derive rules from a machine learning classification problem in biochemistry. The specific point of interest is five species of Glycyrrhiza, or Licorice, and the ability to classify them using High-Performance Thin Layer Chromatography (HPTLC) images. These images were taken using HPTLC methods under varying conditions to provide eight unique views of each species. Each view contains 24 samples with varying counts of the individual species. There are a few techniques applied for feature selection and rule extraction. The first two are based on methods recently pioneered and presented as “Binary Encoding of Random Forests” and “Rule Extraction using Sparse Encoding” (Liu 2012). In addition, an independently developed technique called “Interval Extraction and Consolidation” was applied, which was conceptualized due to the particular nature of the dataset. Altogether, these techniques used in consort with standard machine learning models could narrow a feature space from around one-thousand candidates to only ten. These ten most critical features were then used to derive a set of rules for the classification of the five species of licorice. Regarding feature selection, compared to standard model parameter optimization, the Binary Encoding of Random Forests performed similarly, if not much better, in reducing the feature space in almost all cases. Additionally, the application of Interval Extraction and Consolidation excelled in further simplifying the reduced feature space, often by another factor of five to ten. The selected features were then used for relatively simple rule extraction using decision trees, allowing for a more interpretable model.
Recommended Citation
Kovachev, Bozidar-Brannan, "Exploration of Feature Selection Techniques in Machine Learning Models on HPTLC Images for Rule Extraction" (2023). Honors Theses. 2841.
https://egrove.olemiss.edu/hon_thesis/2841
Accessibility Status
Searchable text
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.