Repository logo

Inconsistency detection in cancer data classification using explainable-AI

Abstract

Abstract Background Accurate classification of cancer-related text data is essential for early diagnosis and effective treatment. However, conventional classification methods often suffer from confusion in error analysis due to data inconsistencies, semantic misalignment, and unreliable labeling. Manual error analysis is labor-intensive and prone to oversight, which limits the clinical utility of these approaches. Aim This study aims to develop a robust and explainable framework that automates and justifies error analysis by detecting inconsistencies—including potential mislabeling—in classification outcomes through a dual-perspective algorithmic approach. Methods We propose a novel dual-perspective framework that integrates unsupervised semantic clustering with supervised classification. Specifically, our approach combines BERT-based BERTopic clustering with SVM classification on Node2Vec embeddings to decouple semantic and structural perspectives. It introduces an Explainable Inconsistency Detection (EID) module to automatically surface and remove inconsistencies between the clustering and classification outputs. Additionally, a collaborative filtering recommender system aligns clusters with ground-truth labels to adaptively refine results, with performance validated through rigorous statistical testing. Results Experimental evaluations on cancer datasets demonstrated substantial improvements in both classification performance and the clarity of error analysis. The optimized framework improved accuracy from 46% to 91% and increased the F1-score from 50% to 89%. Statistical analysis confirmed that these gains were significant (p-values $$< 0.05$$ ) and directly attributable to the targeted removal of inconsistent instances rather than random data exclusion. Conclusions The integration of BERTopic, SVM, and the Explainable Inconsistency Detection (EID) framework enhances both performance and interpretability by addressing semantic contradictions and structural anomalies in biomedical data. This semi-automated, explainable pipeline offers actionable insights into underlying data errors, presenting strong potential for integration into clinical decision support systems. Future work will focus on refining the EID for broader generalization and exploring real-time applications in healthcare.

Description

Keywords

Citation

BMC Artificial Intelligence. 2025 Jul 21;1(1):5

Related Materials

Alternate Version