Repository logo

Detection, Categorization and Repair of Flaky Tests Using Large Language Models

Loading...
Thumbnail ImageThumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Université d'Ottawa / University of Ottawa

Abstract

Software testing is critical for ensuring software dependability. However, some test cases, known as flaky tests, exhibit non-deterministic behavior, passing or failing inconsistently even with the same source code version. These flaky tests create significant overhead in software development, requiring developers to rerun tests or debug code unnecessarily. Traditional approaches for detecting flaky tests involve rerunning them multiple times, a process that is computationally expensive and impractical for large test suites. Machine learning (ML) models have been proposed as a scalable alternative to predict flaky tests without reruns. However, existing ML-based techniques often rely on production code or project-specific features, limiting their generalizability across diverse projects. Furthermore, the use of predefined feature sets results in suboptimal accuracy when applied to realistic datasets. To address these challenges, we propose two novel, black-box, large language model-based solutions: (a) Flakify: A flaky test predictor that relies solely on the source code of test cases, eliminating the need for access to production code or predefined feature sets. Flakify uses CodeBERT, a pre-trained language model, and demonstrates superior performance on two benchmark datasets. It achieves F1-scores of 79% and 73% using cross-validation and per-project validation on the FlakeFlagger dataset, and 98% and 89% on the IDoFT dataset. Flakify outperforms the state-of-the-art solution (FlakeFlagger) by 10 and 18 percentage points in precision and recall, respectively. (b) FlakyFix: A framework designed to predict the required fix for flaky tests by classifying them into 13 distinct fix categories based solely on test code analysis. By leveraging code models and few-shot learning, FlakyFix accurately predicts most fix categories. To further enhance flaky test repairs, we augment GPT 3.5 Turbo prompts with predicted fix category labels. Our experimental results show that 51% to 83% of GPT-suggested repairs pass, with only 16% of the test code needing further modifications for the remaining cases. These two approaches significantly reduce the overhead associated with rerunning flaky tests and provide an efficient method for predicting and repairing flaky tests, making them more suitable for real-world industrial applications.

Description

Keywords

Flaky tests, Test Repair, Large Language Models, Code Models, Few Shot Learning, Software Testing

Citation

Related Materials

Alternate Version