CINECA IRIS Institutional Research Information System

Clustering validation is one of the most important and challenging parts of clustering analysis, as there is no ground truth knowledge to compare the results with. Up till now, the evaluation methods for clustering algorithms have been used for determining the optimal number of clusters in the data, assessing the quality of clustering results through various validity criteria, comparison of results with other clustering schemes, etc. It is also often practically important to build a model on a large amount of training data and then apply the model repeatedly to smaller amounts of new data. This is similar to assign- ing new data points to existing clusters which are constructed on the training set. However, very little practical guidance is available to measure the prediction strength of the constructed model to predict cluster labels for new samples. In this study, we proposed an extension of the cross-validation procedure to evaluate the quality of the clustering model in predict- ing cluster membership for new data points. The performance score was measured in terms of the root mean squared error based on the information from multiple labels of the training and testing samples. The principal component analysis (PCA) followed by k-means clustering algorithm was used to evaluate the proposed method. The clustering model was tested using three benchmark multi-label datasets and has shown promising results with overall RMSE of less than 0.075 and MAPE of less than 12.5% in three datasets.

Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

Tarekegn, Adane Nega^First;Michalak, Krzysztof;Giacobini, Mario^Last

2020-01-01

Abstract

Clustering validation is one of the most important and challenging parts of clustering analysis, as there is no ground truth knowledge to compare the results with. Up till now, the evaluation methods for clustering algorithms have been used for determining the optimal number of clusters in the data, assessing the quality of clustering results through various validity criteria, comparison of results with other clustering schemes, etc. It is also often practically important to build a model on a large amount of training data and then apply the model repeatedly to smaller amounts of new data. This is similar to assign- ing new data points to existing clusters which are constructed on the training set. However, very little practical guidance is available to measure the prediction strength of the constructed model to predict cluster labels for new samples. In this study, we proposed an extension of the cross-validation procedure to evaluate the quality of the clustering model in predict- ing cluster membership for new data points. The performance score was measured in terms of the root mean squared error based on the information from multiple labels of the training and testing samples. The principal component analysis (PCA) followed by k-means clustering algorithm was used to evaluate the proposed method. The clustering model was tested using three benchmark multi-label datasets and has shown promising results with overall RMSE of less than 0.075 and MAPE of less than 12.5% in three datasets.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2020
			
	Titolo rivista
	
				SN COMPUTER SCIENCE
			
	N. Volume
	
				1
			
	Fascicolo
	
				5
			
	Pagine (da)
	
				1
			
	Pagine (a)
	
				9
			
	DOI
	
				https://dx.doi.org/10.1007/s42979-020-00283-z
			
	URL del prodotto (archivi open access, fulltext su sito editore, etc.)
	
				https://link.springer.com/article/10.1007/s42979-020-00283-z
			
	Parole Chiave
	
				Clustering validation, Clustering analysis, Cross-validation, Multi-label data
			
	Tutti gli autori
	
						Tarekegn, Adane Nega; Michalak, Krzysztof; Giacobini, Mario
					
	Appare nelle tipologie:
	
				03A-Articolo su Rivista

File in questo prodotto:

File	Dimensione	Formato
Tarekegn2020_Article_Cross-ValidationApproachToEval_final.pdf Accesso riservato Descrizione: Tarekegn2020_SN Tipo di file: PDF EDITORIALE Dimensione 1.09 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.09 MB	Adobe PDF	Visualizza/Apri Richiedi una copia
Tarekegn2020_Article_Cross-ValidationApproachToEval_preprint.pdf Accesso aperto Descrizione: Tarekegn2020_SN_OA Tipo di file: POSTPRINT (VERSIONE FINALE DELL’AUTORE) Dimensione 711 kB Formato Adobe PDF Visualizza/Apri	711 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1754706

Citazioni

ND

12

ND

social impact