Correlation based Feature Selection impact on the classification of breast cancer patients response to neoadjuvant chemotherapy

Rosati, S.; Gianfreda, C. M.; Balestra, G.; Martincich, L.; Giannini, V.; Regge, D.

doi:10.1109/MeMeA.2018.8438698

The availability of a huge number of variables is not always associated to better classification performances, as some of them can be redundant, irrelevant or source of noise. For this reason, a Feature Selection (FS) step is often applied to high-dimensional datasets. FS based on correlation relies on the idea that 'good feature subsets contain features highly correlated with the class yet uncorrelated with each other'. However, the main problem of this kind of approach is to define a threshold from which considering two variables correlated. In this study, we evaluated the impact of different thresholds on the performances of two classifiers trained to predict response to neoadjuvant chemotherapy (from grade 1 to 5) of 44 patients with breast cancer. First, 27 texture features were computed on the largest slices belonging to the segmented tumor on the pretreatment dynamic contrast enhanced-MRI. Then, we applied a FS algorithm that identifies the couples of variables with absolute value of the linear correlation coefficient above a given threshold and removed, for each couple, the variable less correlated with the response to the neoadjuvant chemotherapy. We tested correlation thresholds ranging from 1 to 0.8 with intervals of 0.01, and we used each obtained subset to construct a Decision Tree (DT) classifier and a Linear Regression Model (LRM). Our results showed that the removal of highly correlated variables (absolute value of the correlation coefficient >0.97) produced a reduction of the DT performance of about 10%. Although the LRM was not able to reach acceptable results in terms of chemotherapy response prediction (accuracy=40.9%), its intrinsic linearity allowed to be more stable to linear redundancy removal.