The availability of data represented with multiple features coming from heterogeneous domains is getting more and more common in real world applications. Such data represent objects of a certain type, connected to other types of data, the features, so that the overall data schema forms a star structure of inter-relationships. Co-clustering these data involves the specification of many parameters, such as the number of clusters for the object dimension and for all the features domains. In this paper we present a novel co-clustering algorithm for heterogeneous star-structured data that is parameter-less. This means that it does not require either the number of row clusters or the number of column clusters for the given feature spaces. Our approach optimizes the Goodman-Kruskal's tau, a measure for cross-association in contingency tables that evaluates the strength of the relationship between two categorical variables. We extend tau to evaluate co-clustering solutions and in particular we apply it in a higher dimensional setting. We propose the algorithm CoStar which optimizes tau by a local search approach. We assess the performance of CoStar on publicly available datasets from the textual and image domains using objective external criteria. The results show that our approach outperforms state-of-the-art methods for the co-clustering of heterogeneous data, while it remains computationally efficient.

Parameter-Less Co-Clustering for Star-Structured Heterogeneous Data

IENCO, Dino;PENSA, Ruggero Gaetano;MEO, Rosa
2013-01-01

Abstract

The availability of data represented with multiple features coming from heterogeneous domains is getting more and more common in real world applications. Such data represent objects of a certain type, connected to other types of data, the features, so that the overall data schema forms a star structure of inter-relationships. Co-clustering these data involves the specification of many parameters, such as the number of clusters for the object dimension and for all the features domains. In this paper we present a novel co-clustering algorithm for heterogeneous star-structured data that is parameter-less. This means that it does not require either the number of row clusters or the number of column clusters for the given feature spaces. Our approach optimizes the Goodman-Kruskal's tau, a measure for cross-association in contingency tables that evaluates the strength of the relationship between two categorical variables. We extend tau to evaluate co-clustering solutions and in particular we apply it in a higher dimensional setting. We propose the algorithm CoStar which optimizes tau by a local search approach. We assess the performance of CoStar on publicly available datasets from the textual and image domains using objective external criteria. The results show that our approach outperforms state-of-the-art methods for the co-clustering of heterogeneous data, while it remains computationally efficient.
2013
26
2
217
254
http://www.springerlink.com/content/4qr677415h616232/
Co-clustering; Star-structured data; Multi-view data
D. Ienco; C. Robardet; R.G. Pensa; R. Meo
File in questo prodotto:
File Dimensione Formato  
DAMI1481_4aperto_658901.pdf

Accesso aperto

Tipo di file: POSTPRINT (VERSIONE FINALE DELL’AUTORE)
Dimensione 1.23 MB
Formato Adobe PDF
1.23 MB Adobe PDF Visualizza/Apri
dami2012_printed.pdf

Accesso riservato

Tipo di file: PDF EDITORIALE
Dimensione 1.36 MB
Formato Adobe PDF
1.36 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/89901
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 50
  • ???jsp.display-item.citation.isi??? 30
social impact