Repository logo

Joint Feature Screening and Subsampling in Analysis of Massive Data

dc.contributor.authorJing, Kaili
dc.contributor.supervisorXu, Chen
dc.date.accessioned2023-05-09T13:53:27Z
dc.date.available2023-05-09T13:53:27Z
dc.date.issued2023-05-09en_US
dc.description.abstractIn modern scientific research, collecting large amounts of data has become increasingly prevalent. For example, in genomics, there have been more than 500, 000 microarrays that are publicly available with each array containing tens of thousands of expression values of molecules; In biomedical engineering, there have been tens of thousands of terabytes of fMRI images with each image containing more than 50, 000 voxel values\cite{fan2014challenges}. Compared to traditional datasets, massive datasets usually hold great promises for extracting crucial information and discovering subtle patterns. However, their characteristics such as huge volume (large sample size and high dimensionality), heterogeneity (multiple latent sub-populations), and corruption (complicated noise) pose substantial challenges to traditional data analysis. These features require an application to deal with not only huge volumes but also complex structures and uncertainties in the data. There is no doubt that Big Data are something of double-edged sword, which has been fundamentally changing and transforming the way we live and think especially in scientific research. Its emergence has spawned new research paradigms to solve some of the general analytical difficulties, which will be addressed here. To ease the analytical difficulties, one viable route is to remove most redundant information from a massive dataset before analyzing it. In this spirit, this dissertation develops effective statistical tools to screen out variables (features) and observations that are less relevant to the analysis. Specifically, the dissertation consists of the following three self-contained research projects. 1. Distributed hard screening for massive data. Massive datasets with a huge number of features are commonly encountered in many scientific areas. Nevertheless, existing classic screening methods are inefficient or even infeasible due to the high computational burden for a large-N-large-p dataset, where N is the sample size and p is the features size. When both N and p are large, this raises the so-called large-N-large-p regime, which creates opportunities for extracting crucial information and poses substantial challenges. Developing effective tools to manage and utilize large-scale data is an attractive and important topic in statistics and some related fields. To address this research problem, I develop a distributed screening method for the large-N-large-p regime. The new method is built upon an ADMM updating procedure for fitting l0-constrained consensus regression, where data are processed at m manageable segments by multiple local computers. Compared to the existing methods, the new method can not only fit the need of computational efficiency for processing massive data, but also screen out most noninfluential features while retaining influential ones with high probability. Under mild conditions, I show that the new method is convergent and leads to an effective screening when the starting value is appropriate specified. The promising performance of the method is supported by extensive numerical and real data studies. 2. Joint feature screening for mixture regression. Finite mixtures of regression models are ubiquitous for analyzing complex data. They aim to detect heterogeneity in the effects of a set of covariates on a response over a finite number of latent classes. When the number of covariates is large, a direct fitting of mixture regressions can be numerically costly and often leads to a poor interpretive value. One practical strategy is to screen out most irrelevant covariates before an in-depth analysis. Such a strategy is referred to as feature screening, which has attracted a great deal of attention in the past decade. Despite the surge of research on feature screening, the existing methods are mainly developed based on certain dependence measure between the response and features over the entire population. In finite mixture of regressions, however, the sets of relevant features in the model are class-specific. This makes feature screening in mixture regression a challenging task that has not been carefully studied. To address this research problem, I propose a novel method for covariate screening in ultrahigh-dimensional Gaussian mixture regressions. The new method is built upon a sparsity-restricted expectation-approximation-maximization algorithm, which allows a precise control of the number of covariates to be retained in each latent class. In the screening process, the joint effects between covariates are naturally accounted and the class-specific screening results are produced without ad hoc steps. These merits give the new method an edge to outperform the existing screening methods. The promising performance of the method is supported by both theory and numerical examples including real data analysis. 3. Low-gradient based subsampling for corrupted massive data. Subsampling is one of basic methodologies used for coupling with massive data, among which “the gradient of loss” based subsampling is accepted as the most effective strategy. In past studies, the large-gradient subsampling is normally suggested, which assigns the subsampling probability directly proportional to the gradient of loss. Such a strategy is effective for uncontaminated massive data but not for corrupted massive data, which is unavoidably encountered in real-world applications. We thus must deal with those corrupted massive data cases. Based on an observation that the use of large-gradient brings the danger of assigning higher sampling probabilities to the heavily corrupted observations, I propose a low-gradient subsampling procedure for corrupted massive data. The new procedure iteratively draws “low-gradient observations” and updates sampling probabilities according to the new prediction performance. In this way, the procedure excludes the most corrupted observations and leads to a robust subsampling strategy. I show that the procedure is statistically consistent with achieving a nearly optimal estimation precision. The promising performance of the procedure is supported by a variety of numerical simulations and real data applications. The effective statistical methods I develop in this dissertation are not only significant attempts but also abreast of explorations to tackle the challenges in Big Data. Based on the core idea of removing redundant information that contributes insignificantly to the analysis, this dissertation provides promising results and reliable guidance to further the frontiers of research analysis of massive data.en_US
dc.identifier.urihttp://hdl.handle.net/10393/44903
dc.identifier.urihttp://dx.doi.org/10.20381/ruor-29109
dc.language.isoenen_US
dc.publisherUniversité d'Ottawa / University of Ottawaen_US
dc.subjectfeature screeningen_US
dc.subjectsubsamplingen_US
dc.subjectmassive dataen_US
dc.subjectheterogeneityen_US
dc.subjectcorruptionen_US
dc.titleJoint Feature Screening and Subsampling in Analysis of Massive Dataen_US
dc.typeThesisen_US
thesis.degree.disciplineSciences / Scienceen_US
thesis.degree.levelDoctoralen_US
thesis.degree.namePhDen_US
uottawa.departmentMathématiques et statistique / Mathematics and Statisticsen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
Jing_Kaili_2023_thesis.pdf
Size:
4.31 MB
Format:
Adobe Portable Document Format
Description:
only a single thesis files in PDF/A format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
license.txt
Size:
6.65 KB
Format:
Item-specific license agreed upon to submission
Description: