A Methodology for Reliable Data Mining on Health Administrative Data: Case Studies on Pediatric Immune-Mediated Inflammatory Diseases in Ontario, Canada

Title: A Methodology for Reliable Data Mining on Health Administrative Data: Case Studies on Pediatric Immune-Mediated Inflammatory Diseases in Ontario, Canada
Authors: Tekieh, Mohammad Hossein
Date: 2022-04-26
Embargo: 2023-04-26
Abstract: Over the past century, the prevalence of immune-mediated inflammatory diseases (IMIDs) has increased worldwide. It has been identified that exposures to environmental factors early in life are associated with increased risk of these diseases. However, hypothesis-driven analyses do not always identify all risk or protective factors, nor do they adequately explain interactions between variables on the risk of disease. Data mining has the capability of exploring the data without considering specific a priori hypotheses, instead providing possible hypotheses for further analysis. Though, data mining techniques are still not popular among epidemiologists as a trustworthy analytical tool to analyze population-based diseases due to inexplicability of some of the methods (e.g., neural networks), unfamiliarity with, or uncommon use of machine learning and data mining methods in real-world health care applications. At the same time, large amounts of routinely collected health data are amassed as a matter of operating electronic health systems. Routinely collected health data are not collected for research purposes; however, they are great sources of information for research as a secondary use of the data. In this study, following the design science research methodology, we developed a methodology to reliably analyze health administrative data using data mining techniques to provide reproducible, reliable, and trustworthy findings. The reliable data mining methodology on health administrative data was designed in this study to address impartiality, validity, and sustainability concerns in five stages: Data Selection, Preprocessing, Modelling, Evaluation, and Feedback. As part of the main contributions, we developed two unique preprocessing guidelines as the key components of the designed methodology in order to standardize technical steps and address contextual sources of bias. While the proposed methodology is general in its design, to evaluate the designed methodology, we implemented it in several case studies on the real health administrative data housed at ICES, Ontario, first to analyze children suffering with an IMID in Ontario, predict new cases, and, most importantly, generate new hypotheses. The first case study was extended to a second one to narrow focus from all IMIDs to asthma which formed the majority of the IMID cases. Eventually, a third case study was implemented focusing on inflammatory bowel disease (IBD) and systemic autoimmune rheumatic diseases (SARDs) to better compare the findings. We applied both predictive and descriptive modelling techniques such as decision tree, neural network, logistic regression, and k-means clustering on the prepared datasets with more than 700K records and over 80 input variables. We built classification models with notable quality of performance (AUC of 68%), identified the significant factors associated to IMIDs, and extracted multifactorial rules causing protectiveness against or high risk of developing asthma, IBD, and SARDs. The factors that highly contributed to the extracted multifactorial rules were “general childhood infection”, “use of antibiotics”, “streptococcus pyogenes”, “respiratory infection”, “gastroenteritis”, “mother's prevalence of any IMID”, and “baby's sex”. The findings were evaluated and verified by health experts. Most data mining studies which are applied to health data do not handle bias and confounding in their work. However, the systematic errors were identified, and their risks were assessed in these case studies due to following the designed reliable methodology. The results with high risk of bias were reported to disregard. Therefore, this process allowed us to apply data mining techniques to discover new multifactorial rules and identify the factors with the highest impact among the 128 factors observed in the past epidemiological studies, while preserving the trust of domain experts in the results.
URL: http://hdl.handle.net/10393/43520
CollectionThèses - Embargo // Theses - Embargo