In this paper we propose and test the use of hierarchical clustering for feature selection in databases. The clustering method is Ward’s with a distance measure based on Goodman-Kruskal Tau. We motivate the choice of this measure and compare it with other ones. Our hierarchical clustering is applied to over 40 data-sets from UCI archive. The proposed approach is interesting from many viewpoints. First, it produces the feature subsets dendrogram which serves as a valuable tool to study relevance relationships among features. Secondarily, the dendrogram is used in a feature selection algorithm to select the best features by a wrapper method. Experiments were run with three different families of classifiers: Naive Bayes, decision trees and k nearest neighbours. Our method allows all the three classifiers to generally outperform their corresponding ones without feature selection.We compare our feature selection with other state-of-the-art methods, obtaining on average a better classification accuracy, though obtaining a lower reduction in the number of features. Moreover, differently from other approaches for feature selection, our method does not require any parameter tuning.

Clustering the Feature Space

IENCO, Dino;MEO, Rosa
2008-01-01

Abstract

In this paper we propose and test the use of hierarchical clustering for feature selection in databases. The clustering method is Ward’s with a distance measure based on Goodman-Kruskal Tau. We motivate the choice of this measure and compare it with other ones. Our hierarchical clustering is applied to over 40 data-sets from UCI archive. The proposed approach is interesting from many viewpoints. First, it produces the feature subsets dendrogram which serves as a valuable tool to study relevance relationships among features. Secondarily, the dendrogram is used in a feature selection algorithm to select the best features by a wrapper method. Experiments were run with three different families of classifiers: Naive Bayes, decision trees and k nearest neighbours. Our method allows all the three classifiers to generally outperform their corresponding ones without feature selection.We compare our feature selection with other state-of-the-art methods, obtaining on average a better classification accuracy, though obtaining a lower reduction in the number of features. Moreover, differently from other approaches for feature selection, our method does not require any parameter tuning.
2008
Fifthteenth Sixteenth Italian Symposium on Advanced Database Systems
Palermo, Italia
Giugno 2008
Proceedings of the Fifthteenth Sixteenth Italian Symposium on Advanced Database Systems
SEBD
-
374
381
http://sebd.org/2008/index.htm
Goodman; -Kruskal Tau; Ward's hierarchical clustering; feature selection; wrapper
Ienco, Dino; Meo, Rosa
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/49974
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact