Reproducible Breast Cancer Classification: A Controlled Comparative Evaluation of Machine Learning Models

Author Name: Nomsa Chisomo Christina Kamgwira, Sukhpreet Singh

1. Student, Department of Computer Applications, Guru Kashi University, Bathinda, Punjab, India

2. Assistant Professor, Department of Computer Applications, Guru Kashi University, Bathinda, Punjab, India

Abstract

<p>The most important predictor of a favourable outcome is the early diagnosis of breast cancer, and the machine learning research community has been working tirelessly to increase the accuracy of these systems, yet there is a reproducibility crisis, with wildly varying results between papers that rarely coincide, possibly because they are using different preprocessing techniques, validation protocols, and evaluation metrics. In this study, six supervised classification algorithms, Logistic Regression, Support Vector Machine (SVM), k-Nearest Neighbours (KNN), Decision Tree, Random Forest, and XGBoost, are compared using a common pipeline on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset [6]. The preprocessing, feature scaling, train-test splitting, hyperparameter search spaces, and cross-validation are the same across all models. SVM and KNN performed best with an accuracy of 98.25%, and KNN gave 100% recall (no false negatives). Logistic Regression was the most successful method with 99.57% ROC AUC. There was no statistically significant difference in SVM and KNN (t = 1.322, p = 0.257) based on a paired t-test. The main result is that simpler, well-tuned classifiers are still very competitive under the same experimental setup on the structured medical data.</p>

Keywords

Breast Cancer Classification, Machine Learning, Reproducibility, WDBC; SVM, KNN, XGBoost, Random Forest, Hyperparameter Tuning.

MRR

MRR Journal

Abstract

Reproducible Breast Cancer Classification: A Controlled Comparative Evaluation of Machine Learning Models

Abstract

Keywords