Predicting High-cost Patients in General Population Using Data Mining Techniques

Title: Predicting High-cost Patients in General Population Using Data Mining Techniques
Authors: Izad Shenas, Seyed Abdolmotalleb
Date: 2012
Abstract: In this research, we apply data mining techniques to a nationally-representative expenditure data from the US to predict very high-cost patients in the top 5 cost percentiles, among the general population. Samples are derived from the Medical Expenditure Panel Survey’s Household Component data for 2006-2008 including 98,175 records. After pre-processing, partitioning and balancing the data, the final MEPS dataset with 31,704 records is modeled by Decision Trees (including C5.0 and CHAID), Neural Networks. Multiple predictive models are built and their performances are analyzed using various measures including correctness accuracy, G-mean, and Area under ROC Curve. We conclude that the CHAID tree returns the best G-mean and AUC measures for top performing predictive models ranging from 76% to 85%, and 0.812 to 0.942 units, respectively. Among a primary set of 66 attributes, the best predictors to estimate the top 5% high-cost population include individual’s overall health perception, history of blood cholesterol check, history of physical/sensory/mental limitations, age, and history of colonic prevention measures. It is worthy to note that we do not consider number of visits to care providers as a predictor since it has a high correlation with the expenditure, and does not offer a new insight to the data (i.e. it is a trivial predictor). We predict high-cost patients without knowing how many times the patient was visited by doctors or hospitalized. Consequently, the results from this study can be used by policy makers, health planners, and insurers to plan and improve delivery of health services.
CollectionThèses, 2011 - // Theses, 2011 -