Repository logo

Hard-Threshold-Based Feature Screening for Ultrahigh-Dimensional GLMs: Methods and Software

dc.contributor.authorZang, Qianxiang
dc.contributor.supervisorBurkett, Kelly
dc.contributor.supervisorXu, Chen
dc.date.accessioned2025-03-24T19:22:49Z
dc.date.available2025-03-24T19:22:49Z
dc.date.issued2025-03-24
dc.description.abstractIn modern scientific research, the dimensionality and complexity of datasets are growing exponentially. Geneticists analyze data from millions of Single Nucleotide Polymorphisms (SNPs) to identify disease associations, while cybersecurity experts monitor vast amounts of data packets in real-time to detect spam and viruses. Financial analysts continually update bankruptcy predictions as new data become available, and social media platforms identify real-time trends and hot topics from immense streams of content. Undoubtedly, these new types of datasets hold great potential for uncovering subtle patterns. However, their ultrahigh dimensionality, complex structures, and the need for real-time processing present considerable challenges for traditional statistical methods. This calls for the development of novel tools that make modern data analytics viable. To ease the analytical difficulties, it is often beneficial to pre-process a high-dimensional dataset by efficiently detecting and eliminating a large number of features that are irrelevant to the analysis. This strategy is referred to as feature screening, which aims to bring down the computational cost without loss of key information. Among the existing screening techniques, the hard-threshold-based method has attracted a great deal of attention for its high accuracy and stability. This dissertation develops user-friendly statistical software for this attractive technique and further designs new screening approaches to extend its applicability. Specifically, the dissertation consists of the following three self-contained research projects. The first project is on developing a publicly available R package SMLE for joint feature screening in ultrahigh-dimensional generalized linear models. The package provides a user-friendly environment to carry out the Sparsity-restricted Maximum Likelihood Estimation (SMLE) screening method, which is computationally convenient and practically effective. In the second project, I design a stability-enhanced screening procedure via an iterative splicing for sparse estimation (ISSE) technique. Compared with SMLE, ISSE is insensitive to the initial model and significantly improves the screening accuracy for data with complex correlation structures. In the third project, I address feature screening in a streaming-data setup, where batches of new features arrive over time. For this challenging task, I propose a novel Batch Adapted Neighbour Searching (BANS) algorithm, which screens features in real-time based on a one-step approximated hard-thresholding procedure. This dissertation presents innovative attempts to address the challenges in analyzing ultrahigh dimensional data. The developed software SMLE provides effective and publicly available tools of feature screening for researchers from various domains. It has had over 35,000 downloads from users around the world and has been highlighted in the Canadian Statistical Society's quarterly newsletter in 2021. The two new algorithms ISSE and BANS significantly improves the original SMLE method and substantially extends its application scope to the emerging fields.
dc.identifier.urihttp://hdl.handle.net/10393/50286
dc.identifier.urihttps://doi.org/10.20381/ruor-30988
dc.language.isoen
dc.publisherUniversité d'Ottawa / University of Ottawa
dc.rightsAttribution 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectFeature Screening
dc.subjectUltrahigh-dimensional Data
dc.subjectGeneralized Linear Models
dc.subjectStatistical Software
dc.subjectReal-time Data Processing
dc.subjectSparse Estimation
dc.subjectIterative Splicing
dc.titleHard-Threshold-Based Feature Screening for Ultrahigh-Dimensional GLMs: Methods and Software
dc.typeThesisen
thesis.degree.disciplineSciences / Science
thesis.degree.levelDoctoral
thesis.degree.namePhD
uottawa.departmentMathématiques et statistique / Mathematics and Statistics

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
Zang_Qianxiang_2025_thesis.pdf
Size:
2.94 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
license.txt
Size:
6.65 KB
Format:
Item-specific license agreed upon to submission
Description: