Clustering the Feature Space

Ienco, Dino; Meo, Rosa

In this paper we propose and test the use of hierarchical clustering for feature selection in databases. The clustering method is Ward’s with a distance measure based on Goodman-Kruskal Tau. We motivate the choice of this measure and compare it with other ones. Our hierarchical clustering is applied to over 40 data-sets from UCI archive. The proposed approach is interesting from many viewpoints. First, it produces the feature subsets dendrogram which serves as a valuable tool to study relevance relationships among features. Secondarily, the dendrogram is used in a feature selection algorithm to select the best features by a wrapper method. Experiments were run with three different families of classifiers: Naive Bayes, decision trees and k nearest neighbours. Our method allows all the three classifiers to generally outperform their corresponding ones without feature selection.We compare our feature selection with other state-of-the-art methods, obtaining on average a better classification accuracy, though obtaining a lower reduction in the number of features. Moreover, differently from other approaches for feature selection, our method does not require any parameter tuning.