Handling Missing Data in the Multivariate Growth Curve Model: Comparative Evaluations Using Extensive Simulations

Wang, Zirui2025-08-142025-08-142025-08-14http://hdl.handle.net/10393/50766https://doi.org/10.20381/ruor-31321The multivariate growth curve models (GCMs) are generalized multivariate analysis of variance (GMANOVA) models useful in analyzing of longitudinal data, growth curves or other datasets involving response curves. Unlike the MANOVA model and traditional linear models (eg. generalized linear mixed models), the GCM allows a structured mean. That is the mean trajectory for each group is expressed as a function of and is assumed to be represented by a polynomials with appropriate degree. Therefore, the GCM involves two design matrices: the within-individual design matrix capturing the mean structure as a function of time and the between-individual design matrix representing group membership coded as dummy variables. Several extensions are made to accommodate clustered longitudinal data, high-dimensional data as well as data that do not follow the multivariate normal distribution. Models that allow structured covariances (eg. compound symmetry, autoregressive, etc) are also available. A critical challenge we encounter in data analysis is presence of missing data, which is relatively more common in studies involving longitudinal data. Approaches for handling missing longitudinal data do exist and are used in practical applications. However, applications involving the GCM are limited to data with no missing values or complete data analysis, where individuals with missing data are deleted. This leads to either the GCM not being used in practice (because of its demonstrated optimality, in particular in when sample size is small) or loss of information, hence suboptimal inference. We performed a comprehensive literature review and identified three multivariate methods for handling missing data, developed under the framework of GCMs, taking advantage of the bilinear nature of the model. However, our literature review also revealed that implementations and practical applications of these methods are limited. This is because of a combination of many factors including perhaps the complexity of the proposed approaches and lack of software readily available to implement the methods, despite the methods being available for several decades. Moreover, performances of the methods have not yet been investigated and compared. In this study, we considered the three multivariate methods: Kleinbaum’s (1973) method based on the generalized growth curve model (GGCM), Liski’s (1984) Bayesian method and Liski’s (1985) EM algorithm. We implemented these methods in R, allowing flexible adaptations to different settings. We also made some extensions to allow generalization of the methods to any number of time points, groups and sample size. We conducted extensive simulations to evaluate and compare performance of these methods in estimation of the mean and variance-covariance parameters of the GCM, we also identified limitations and methodological gaps, some of which we addressed in this thesis. The methods are compared by considering element-wise and aggregated bias and mean squared error (MSE) as well as by comparing them to the gold-standard (assuming no missing data) and complete case analysis (list-wise deletion of missing data, which is not current practice in applications involving the GCM). Our simulation results show that complete case analysis (ignoring missing data) is not desirable in most of the scenarios considered, but in particular when the sample size is small and the missing percentage is high as well as when the missingness mechanism is non-ignorable. Even when the missingness percentage is small and the mechanism is missing completely at random (MCAR) (which is considered ignorable), the methods developed under the GCM framework provided uniformly smaller MSE values than the complete case analysis. Overall, the simulation results also show that the Liski’s EM algorithm and Liski's Bayesian method overall showed comparable performance as gold-standard in estimating both the mean and variance-covariance parameters. However, these methods become unstable in estimating the variance-covariance matrix in scenarios where the missing percentage is high. This is the case even when the missing mechanism is MCAR or missing at random (MAR). Kleinbaum's GGCM method is unreliable for estimating the mean parameter estimators of the GCM, while it remains stable for estimating the variance-covariance matrix. In fact, in most scenarios, Kleinbaum's method is the most optimal approach for estimating the variance-covariance matrix.enAttribution-NonCommercial 4.0 Internationalhttp://creativecommons.org/licenses/by-nc/4.0/Multivariate Growth Curve ModelMissing DataLongitudinal DataExtensive SimulationHandling Missing Data in the Multivariate Growth Curve Model: Comparative Evaluations Using Extensive SimulationsThesis