Machine Learning Pipeline for Early Diabetes Detection: A Comparative Study with Explainable AI

Barzegar, Yas; Barzegar, Atrin; Bellini, Francesco; D'Ascenzo, Fabrizio; Gorelova, Irina; Pisani, Patrizio

doi:10.3390/fi17110513

The use of Artificial Intelligence (AI) in healthcare has significantly advanced early disease detection, enabling timely diagnosis and improved patient outcomes. This work proposes an end-to-end machine learning (ML) model for predicting diabetes based on data quality by following key steps, including advanced preprocessing by KNN imputation, intelligent feature selection, class imbalance with a hybrid approach of SMOTEENN, and multi-model classification. We rigorously compared nine ML classifiers, namely ensemble approaches (Random Forest, CatBoost, XGBoost), Support Vector Machines (SVM), and Logistic Regression (LR) for the prediction of diabetes disease. We evaluated performance on specificity, accuracy, recall, precision, and F1-score to assess generalizability and robustness. We employed SHapley Additive exPlanations (SHAP) for explainability, ranking, and identifying the most influential clinical risk factors. SHAP analysis identified glucose levels as the dominant predictor, followed by BMI and age, providing clinically interpretable risk factors that align with established medical knowledge. Results indicate that ensemble models have the highest performance among the others, and CatBoost performed the best, which achieved an ROC-AUC of 0.972, an accuracy of 0.968, and an F1-score of 0.971. The model was successfully validated on two larger datasets (CDC BRFSS and a 130-hospital dataset), confirming its generalizability. This data-driven design provides a reproducible platform for applying useful and interpretable ML models in clinical practice as a primary application for future Internet-of-Things-based smart healthcare systems.