Clustering validation is one of the most important and challenging parts of clustering analysis, as there is no ground truth knowledge to compare the results with. Up till now, the evaluation methods for clustering algorithms have been used for determining the optimal number of clusters in the data, assessing the quality of clustering results through various validity criteria, comparison of results with other clustering schemes, etc. It is also often practically important to build a model on a large amount of training data and then apply the model repeatedly to smaller amounts of new data. This is similar to assign- ing new data points to existing clusters which are constructed on the training set. However, very little practical guidance is available to measure the prediction strength of the constructed model to predict cluster labels for new samples. In this study, we proposed an extension of the cross-validation procedure to evaluate the quality of the clustering model in predict- ing cluster membership for new data points. The performance score was measured in terms of the root mean squared error based on the information from multiple labels of the training and testing samples. The principal component analysis (PCA) followed by k-means clustering algorithm was used to evaluate the proposed method. The clustering model was tested using three benchmark multi-label datasets and has shown promising results with overall RMSE of less than 0.075 and MAPE of less than 12.5% in three datasets.
Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets
Tarekegn, Adane Nega
First
;Giacobini, MarioLast
2020-01-01
Abstract
Clustering validation is one of the most important and challenging parts of clustering analysis, as there is no ground truth knowledge to compare the results with. Up till now, the evaluation methods for clustering algorithms have been used for determining the optimal number of clusters in the data, assessing the quality of clustering results through various validity criteria, comparison of results with other clustering schemes, etc. It is also often practically important to build a model on a large amount of training data and then apply the model repeatedly to smaller amounts of new data. This is similar to assign- ing new data points to existing clusters which are constructed on the training set. However, very little practical guidance is available to measure the prediction strength of the constructed model to predict cluster labels for new samples. In this study, we proposed an extension of the cross-validation procedure to evaluate the quality of the clustering model in predict- ing cluster membership for new data points. The performance score was measured in terms of the root mean squared error based on the information from multiple labels of the training and testing samples. The principal component analysis (PCA) followed by k-means clustering algorithm was used to evaluate the proposed method. The clustering model was tested using three benchmark multi-label datasets and has shown promising results with overall RMSE of less than 0.075 and MAPE of less than 12.5% in three datasets.File | Dimensione | Formato | |
---|---|---|---|
Tarekegn2020_Article_Cross-ValidationApproachToEval_final.pdf
Accesso riservato
Descrizione: Tarekegn2020_SN
Tipo di file:
PDF EDITORIALE
Dimensione
1.09 MB
Formato
Adobe PDF
|
1.09 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Tarekegn2020_Article_Cross-ValidationApproachToEval_preprint.pdf
Accesso aperto
Descrizione: Tarekegn2020_SN_OA
Tipo di file:
POSTPRINT (VERSIONE FINALE DELL’AUTORE)
Dimensione
711 kB
Formato
Adobe PDF
|
711 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.