Date of Award
M.S. in Engineering Science
Computer and Information Science
The current technical practice for doing classification has limitations when using gene expression microarray data. For example, the robustness of top scoring pairs does not extend to some datasets involving small data size and the gene set with best discrimination power may not be involve a combination of genes. Hence, it is necessary to construct a discriminative and stable classifier that generates highly informative gene sets. As we know, not all the features will be active in a biological process. So a good feature selector should be robust with respect to noise and outliers; the challenge is to select the most informative genes. In this study, the top discriminating pair (TDP) approach is motivated by this issue and aims to reveal which features are highly ranked according to their discrimination power. To identify TDPS, each pair of genes is assigned a score based on their relative probability distribution. Our experiment combines the TDP methodology with information gain (ig) to achieve an effective feature set. To illustrate the effectiveness of TDP with ig, we applied this method to two breast cancer datasets (Wang et al., 2005 and Van't Veer et al., 2002). The result from these experimental datasets using the TDP method is competitive with the baseline method using random forests. Information gain combined with the TDP algorithm used in this study provides a new effective method for feature selection for machine learning.
Gui, Tian, "A Pairwise Feature Selection Method For Gene Data Using Information Gain" (2014). Electronic Theses and Dissertations. 943.
Emphasis: Computer Science