An Empirical Study on the Resilience of Cloud-Native Systems Using Dynamic Scaling Strategies

Moeini, Behrad2025-04-112025-04-112025-04-11http://hdl.handle.net/10393/50333https://doi.org/10.20381/ruor-31015Cloud-native systems are essential in customer service, handling complex interactions and fluctuating loads. This thesis provides an empirical study on how companies can assess a predictive analysis method to maintain their cloud-based services focusing on challenges such as increased response times, higher failure rates during peak usage, and inefficient resource allocation. We perform an empirical study answering important resource management questions and addressing the needs to meet predefined service level objectives (SLOs) for response time and failure rate. Our predictive framework integrates proactive predictive models, real-time monitoring, and Kubernetes' Horizontal Pod Autoscaler (HPA) to dynamically allocate resources effectively. Our empirical study aims to achieve the following: (1) determine achievable SLOs with fixed resources; (2) identify minimum resources needed to meet desired SLOs; and (3) estimate maximum user capacity with given resources. Using decision tree regression models, we analyzed configurations focusing on CPU and memory utilization. Our findings show that the number of replicas significantly affects response time and failure rates. CPU utilization thresholds were slightly more effective than the memory thresholds. The optimized models achieved high predictive accuracy, with R-square values up to 0.85. Our research presents a study on how engineers can perform predictive resource management tasks for cloud-native systems - AI-driven chatbots being an example of such systems - to ensure resilience and responsiveness.enCloud-native systemsService Level Objectives (SLOs)An Empirical Study on the Resilience of Cloud-Native Systems Using Dynamic Scaling StrategiesThesis