Model Evaluation Beyond AUC: A Comparative Study of Somers’ D, Log Loss, Population Stability Index (PSI), and Kolmogorov–Smirnov (KS) Statistic in Credit Risk and Healthcare Prediction Models
DOI:
https://doi.org/10.63282/3050-9246/ICRTCSIT-113Keywords:
Credit risk, Model evaluation, AUC, KS-statistic, Somers’ D, Population Stability Index, Log Loss, Healthcare predictionAbstract
The Area Under the Receiver Operating Characteristic Curve (AUC) is the dominant evaluation metric in machine learning classification. However, AUC alone cannot capture important properties such as calibration, stability, and practical separability at thresholds. This paper presents an empirical comparison of AUC with Somers’ D, the Kolmogorov–Smirnov (KS) statistic, Log Loss, and the Population Stability Index (PSI) across three benchmark datasets: (1) the Breast Cancer dataset from scikit-learn, (2) the Heart Failure dataset from Kaggle, and (3) the Lending dataset from Kaggle. Our results show that for the Cancer dataset, Logistic Regression achieves near-perfect discrimination (AUC = 0.999, KS = 0.977) with low log loss and stable PSI, outperforming more complex models. In the Heart dataset, Gradient Boosting offers the best balance between discrimination (AUC = 0.943, KS = 0.784) and stability (PSI = 0.076), while Random Forest, though highly accurate, shows instability (PSI = 0.183). In the Lending dataset, all models show modest discrimination (AUC ≈ 0.70), but Logistic Regression and Gradient Boosting offer the best trade-off between simplicity, interpretability, and stability. These findings emphasize the importance of a multi-metric evaluation framework that goes beyond AUC, integrating discrimination, calibration, and stability metrics for trustworthy machine learning in regulated domains such as finance and healthcare
Downloads
References
[1] D. J. Hand and W. E. Henley, “Statistical classification methods in consumer credit scoring,” J. Royal Statistical Society A, vol. 160, 1997.
[2] L. Thomas, Consumer Credit Models: Pricing, Profit, and Portfolios, Oxford Univ. Press, 2009.
[3] S. García et al., “Evaluating classifier performance with highly imbalanced Big Data,” Journal of Big Data, vol. 10, 2023.
[4] B. Van Calster et al., “Calibration: the Achilles heel of predictive analytics,” BMC Medicine, vol. 17, no. 1, 2019.
[5] M. Majlatow et al., “Uncertainty-Aware Predictive Process Monitoring in Health- care,” Applied Sciences, vol. 15, no. 14, 2025.
[6] M. L. Desai et al., “Assessing calibration and bias of a deployed machine learning malnutrition prediction model,” JAMIA, 2023.
[7] A. Sudjianto and D. Burakov, “An Information-Theoretic Framework for Credit Risk Modeling,” arXiv:2509.09855, 2025.
[8] M. L. D. Santos et al., “Machine Learning for Credit Risk Prediction: A Systematic Literature Review,” Preprints.org, 2023.
[9] B. Van Calster et al., “Calibration of risk prediction models: impact on decision- analytic performance,” Medical Decision Making, vol. 39, no. 5, 2019.
[10] N. Siddiqi, Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring, Wiley, 2005.
