Date of Award
Computer and Information Science
In today's world, the amount of raw data archived across multiple distinct domains is growing at an exponential rate. "Data Mining" is a continuously evolving family of processes by which individuals extract useful information from these data. Classification is one of these processes, and is the construction of varying types of descriptive models from labeled data objects, for the purpose of predicting the label of those objects with unknown labels. The construction of these modules is often adversely affected by the presence of incorrect values or outlier values within the data, a phenomenon known as noise. The original motivation of this research was to test the performance of the binary genetic algorithm, one of a multitude of algorithms used for model construction, in the presence of data with varying percentages of noise. However, in the course of experimentation, several issues arose concerning the effectiveness of the binary genetic algorithm as a classifier. Specifically, the chosen method for encoding classification hypotheses demonstrated limited scalability. Furthermore, the chosen method for encoding continuous and nominally valued data attributes was discovered to be unreasonably strict, leading to poor performance. Further research should be undergone to investigate a more reasonable encoding method. However, the algorithm performed favorably on purely categorical data with a relatively moderate number of small-domained dimensions. Upon injecting varying percentages of noise into these data, the algorithm exhibited a slow, steady descent in classification accuracy. These results lead to the conclusion that the binary genetic algorithm should not be discounted as a possible answer to the question of data classification, especially for data sets with the above characteristics, and further research could reveal hypothesis encoding strategies that will result in improved scalability.
Stine, Matthew E., "Performance of Genetic Algorithms for Data Classification" (2001). Honors Theses. 676.