Explaining Random Forest and XGBoost with Shallow Decision Trees by Co-clustering Feature Importance

Pensa, Ruggero G.; Crombach, Anton; Peignier, Sergio; Rigotti, Christophe

doi:10.1007/s10994-025-06932-9

Transparency is a non-functional requirement of machine learning that promotes interpretable models and easily explainable outcomes. Unfortunately, interpretable classification models, such as linear, rule-based, and decision tree models, are superseded by more accurate but complex learning paradigms, such as deep neural networks and ensemble methods. More specifically, for tabular data classification, models based on tree ensembles, such as random forest or XGBoost, are still competitive compared to deep learning ones and are often preferred to the latter. However, they share the same interpretability issues, due to the complexity of the learned model and, consequently, offer low explainability of predictions. Existing solutions consist of computing some feature importance score or extracting an approximate surrogate model from the learned tree ensemble. However, these methods lead to surrogate models with either poor fidelity or questionable comprehensibility. In this paper, we propose to improve this trade-off using Goodman-Kruskal’s association measure to find groups of instances with predictions that are explained by shared groups of features. To build this structure, instances are first described by SHAP values, which capture local feature importance, and then co-clustered with features on the basis of these SHAP values. Next, a surrogate model is built as a set of shallow decision trees learned for the different groups of instances and subsets of relevant features. Our experiments show that our method produces surrogate models that explain random forest and XGBoost classifiers with competitive fidelity and higher comprehensibility compared to recent state-of-the-art competitors.