CINECA IRIS Institutional Research Information System

Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of the same categorical attribute, since they are not ordered. In this paper, we propose a method to learn a context-based distance for categorical attributes. The key intuition of this work is that the distance between two values of a categorical attribute Ai can be determined by the way in which the values of the other attributes Aj are distributed in the dataset objects: if they are similarly distributed in the groups of objects in correspondence of the distinct values of Ai a low value of distance is obtained. We propose also a solution to the critical point of the choice of the attributes Aj. We validate our approach on various real world and synthetic datasets, by embedding our distance learning method in both a partitional and a hierarchical clustering algorithm. Experimental results show that our method is competitive w.r.t. categorical data clustering approaches in the state of the art.

Context-Based Distance Learning for Categorical Data Clustering

IENCO, Dino;PENSA, Ruggero Gaetano;MEO, Rosa

2009-01-01

Abstract

Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of the same categorical attribute, since they are not ordered. In this paper, we propose a method to learn a context-based distance for categorical attributes. The key intuition of this work is that the distance between two values of a categorical attribute Ai can be determined by the way in which the values of the other attributes Aj are distributed in the dataset objects: if they are similarly distributed in the groups of objects in correspondence of the distinct values of Ai a low value of distance is obtained. We propose also a solution to the critical point of the choice of the attributes Aj. We validate our approach on various real world and synthetic datasets, by embedding our distance learning method in both a partitional and a hierarchical clustering algorithm. Experimental results show that our method is competitive w.r.t. categorical data clustering approaches in the state of the art.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2009
			
	Titolo dell'evento
	
				8th International Symposium on Intelligent Data Analysis, IDA 2009, Lyon
			
	Luogo dell'evento
	
				Lyon, France
			
	Data dell'evento
	
				August 31 - September 2, 2009
			
	Titolo del volume
	
				Advances in Intelligent Data Analysis VIII, 8th International Symposium on Intelligent Data Analysis, IDA 2009, Lyon, France, August 31 - September 2, 2009. Proceedings
			
	Nome editore
	
				SPRINGER-VERLAG
			
	N. Volume
	
				5772/2009
			
	Pagine (da)
	
				83
			
	Pagine (a)
	
				94
			
	Codice ISBN
	
				9783642039140
			
	DOI
	
				https://dx.doi.org/10.1007/978-3-642-03915-7_8
			
	URL del prodotto (archivi open access, fulltext su sito editore, etc.)
	
				http://ida09.liris.cnrs.fr/
			
	Tutti gli autori
	
						D. Ienco; R. G. Pensa; R. Meo
					
	Appare nelle tipologie:
	
				04A-Conference paper in volume

File in questo prodotto:

File	Dimensione	Formato
ida2009_dilca.pdf Accesso riservato Tipo di file: PREPRINT (PRIMA BOZZA) Dimensione 189.84 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	189.84 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/66894

Citazioni

ND

49

37

social impact