Feature selection is an important task in data mining because it allows to reduce the data dimensionality and eliminates the noisy variables. Traditionally, feature selection has been applied in supervised scenarios rather than in unsupervised ones. Nowadays, the amount of unsupervised data available on the web is huge, thus motivating an increasing interest in feature selection for unsupervised data. In this paper we present some results in the domain of document categorization. We use the well-known PageRank algorithm to perform a random-walk through the feature space of the documents. This allows to rank and subsequently choose those features that better represent the data set. When compared with previous work based on information gain, our method allows classifiers to obtain good accuracy especially when few features are retained.
Using PageRank in Feature Selection
IENCO, Dino;MEO, Rosa;BOTTA, Marco
2008-01-01
Abstract
Feature selection is an important task in data mining because it allows to reduce the data dimensionality and eliminates the noisy variables. Traditionally, feature selection has been applied in supervised scenarios rather than in unsupervised ones. Nowadays, the amount of unsupervised data available on the web is huge, thus motivating an increasing interest in feature selection for unsupervised data. In this paper we present some results in the domain of document categorization. We use the well-known PageRank algorithm to perform a random-walk through the feature space of the documents. This allows to rank and subsequently choose those features that better represent the data set. When compared with previous work based on information gain, our method allows classifiers to obtain good accuracy especially when few features are retained.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.