Transparency is a non-functional requirement of machine learning that promotes interpretable models and easily explainable outcomes. Unfortunately, interpretable classification models, such as linear, rule-based, and decision tree models, are superseded by more accurate but complex learning paradigms, such as deep neural networks and ensemble methods. More specifically, for tabular data classification, models based on tree ensembles, such as random forest or XGBoost, are still competitive compared to deep learning ones and are often preferred to the latter. However, they share the same interpretability issues, due to the complexity of the learned model and, consequently, offer low explainability of predictions. Existing solutions consist of computing some feature importance score or extracting an approximate surrogate model from the learned tree ensemble. However, these methods lead to surrogate models with either poor fidelity or questionable comprehensibility. In this paper, we propose to improve this trade-off using Goodman-Kruskal’s association measure to find groups of instances with predictions that are explained by shared groups of features. To build this structure, instances are first described by SHAP values, which capture local feature importance, and then co-clustered with features on the basis of these SHAP values. Next, a surrogate model is built as a set of shallow decision trees learned for the different groups of instances and subsets of relevant features. Our experiments show that our method produces surrogate models that explain random forest and XGBoost classifiers with competitive fidelity and higher comprehensibility compared to recent state-of-the-art competitors.

Explaining Random Forest and XGBoost with Shallow Decision Trees by Co-clustering Feature Importance

Pensa, Ruggero G.
First
Membro del Collaboration Group
;
2025-01-01

Abstract

Transparency is a non-functional requirement of machine learning that promotes interpretable models and easily explainable outcomes. Unfortunately, interpretable classification models, such as linear, rule-based, and decision tree models, are superseded by more accurate but complex learning paradigms, such as deep neural networks and ensemble methods. More specifically, for tabular data classification, models based on tree ensembles, such as random forest or XGBoost, are still competitive compared to deep learning ones and are often preferred to the latter. However, they share the same interpretability issues, due to the complexity of the learned model and, consequently, offer low explainability of predictions. Existing solutions consist of computing some feature importance score or extracting an approximate surrogate model from the learned tree ensemble. However, these methods lead to surrogate models with either poor fidelity or questionable comprehensibility. In this paper, we propose to improve this trade-off using Goodman-Kruskal’s association measure to find groups of instances with predictions that are explained by shared groups of features. To build this structure, instances are first described by SHAP values, which capture local feature importance, and then co-clustered with features on the basis of these SHAP values. Next, a surrogate model is built as a set of shallow decision trees learned for the different groups of instances and subsets of relevant features. Our experiments show that our method produces surrogate models that explain random forest and XGBoost classifiers with competitive fidelity and higher comprehensibility compared to recent state-of-the-art competitors.
2025
114
12
1
27
https://link.springer.com/article/10.1007/s10994-025-06932-9
Explainable AI, Tree ensemble model, Co-clustering, SHAP value
Pensa, Ruggero G.; Crombach, Anton; Peignier, Sergio; Rigotti, Christophe
File in questo prodotto:
File Dimensione Formato  
ml2025_online.pdf

Accesso aperto

Descrizione: paper online OA
Tipo di file: PDF EDITORIALE
Dimensione 2.96 MB
Formato Adobe PDF
2.96 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2106591
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact