Electronic Theses and Dissertations

Date of Award

1-1-2012

Document Type

Dissertation

Degree Name

Ph.D. in Mathematics

Department

Mathematics

First Advisor

Xin Dang

Second Advisor

Yixin Chen

Third Advisor

Ali Al-Sharadqah

Relational Format

dissertation/thesis

Abstract

Classical multivariate statistical inference methods including multivariate analysis of variance, principal component analysis, factor analysis, canonical correlation analysis are based on sample covariance matrix. Those moment-based techniques are optimal (most efficient) under the normality distributional assumption. They are, however, extremely sensitive to outlying observations, susceptible to small perturbation in data and poor in the efficiency for heavy-tailed distributions. A straightforward treatment is to replace the sample covariance matrix with a robust one. Visuri et al. (2000) proposed a technique for robust covariance matrix estimation based on different notions of multivariate sign and rank. Among them, the spatial rank based covariance matrix estimator that utilizes a robust scale estimator (MRCM) is especially appealing due to its high robustness, computational ease and good efficiency. In this dissertation, properties of the estimator on orthogonal equivariance under any distribution and affine equivariance under elliptically symmetric distributions have been established. The major robustness properties of the estimator are studied by the breakdown point and influence function analysis. More specifically, the finite sample breakdown point is obtained and the upper bound of the finite sample breakdown point can be achieved by a proper choice of univariate robust scale estimator. The influence functions for eigenvalues and eigenvectors of the estimator are derived. They are found to be bounded under some mild assumptions. Moreover, empirical comparisons to popular robust MCD, M and S estimators show that MRCM has a competitive performance on efficiency as well as robustness. With rapid advances in information technology, data have been becoming huge in size and complex in structure. A single elliptical distribution is no longer sufficient to model such data. This motivates a generalization of our notion of MRCM to mixture models. In this dissertation, we propose a robust Spatial-EM algorithm for estimating parameters in the mixture model. Rather than using sample covariance matrix in each M-step, Spatial-EM ingeniously implements MRCM to enhance stability and robustness of the estimation procedure. Analyzing the log-likelihood function, the proposed one is found to be closely related to the maximum likelihood estimator (MLE) of Kotz type mixture model. Comparing with the direct MLE, Spatial-EM has advantages in computation ease as well as stability. Applications of Spatial-EM to data mining become natural. We illustrate procedures how to use Spatial-EM for supervised and unsupervised learning problems. More specifically, robust clustering and outlier detection methods based on Spatial-EM have been proposed. We adopt the outlier detection to taxonomic research on fish species novelty discovery. UCI Wisconsin diagnostic breast cancer data and Yeast cell cycle data are used for clustering analysis. Comparing with the regular EM and many other existing methods such as X-EM and SVM, Spatial-EM demonstrates its competitive classification power and high robustness.

Classical multivariate statistical inference methods including multivariate analysis of variance, principal component analysis, factor analysis, canonical correlation analysis are based on sample covariance matrix. Those moment-based techniques are optimal (most efficient) under the normality distributional assumption. They are, however, extremely sensitive to outlying observations, susceptible to small perturbation in data and poor in the efficiency for heavy-tailed distributions. A straightforward treatment is to replace the sample covariance matrix with a robust one. Visuri et al. (2000) proposed a technique for robust covariance matrix estimation based on different notions of multivariate sign and rank. Among them, the spatial rank based covariance matrix estimator that utilizes a robust scale estimator (MRCM) is especially appealing due to its high robustness, computational ease and good efficiency. In this dissertation, properties of the estimator on orthogonal equivariance under any distribution and affine equivariance under elliptically symmetric distributions have been established. The major robustness properties of the estimator are studied by the breakdown point and influence function analysis. More specifically, the finite sample breakdown point is obtained and the upper bound of the finite sample breakdown point can be achieved by a proper choice of univariate robust scale estimator. The influence functions for eigenvalues and eigenvectors of the estimator are derived. They are found to be bounded under some mild assumptions. Moreover, empirical comparisons to popular robust MCD, M and S estimators show that MRCM has a competitive performance on efficiency as well as robustness.

With rapid advances in information technology, data have been becoming huge in size and complex in structure. A single elliptical distribution is no longer sufficient to model such data. This motivates a generalization of our notion of MRCM to mixture models. In this dissertation, we propose a robust Spatial-EM algorithm for estimating parameters in the mixture model. Rather than using sample covariance matrix in each M-step, Spatial-EM ingeniously implements MRCM to enhance stability and robustness of the estimation procedure. Analyzing the log-likelihood function, the proposed one is found to be closely related to the maximum likelihood estimator (MLE) of Kotz type mixture model. Comparing with the direct MLE, Spatial-EM has advantages in computation ease as well as stability.

Applications of Spatial-EM to data mining become natural. We illustrate procedures how to use Spatial-EM for supervised and unsupervised learning problems. More specifically, robust clustering and outlier detection methods based on Spatial-EM have been proposed. We adopt the outlier detection to taxonomic research on fish species novelty discovery. UCI Wisconsin diagnostic breast cancer data and Yeast cell cycle data are used for clustering analysis. Comparing with the regular EM and many other existing methods such as X-EM and SVM, Spatial-EM demonstrates its competitive classification power and high robustness.

Included in

Mathematics Commons

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.