CINECA IRIS Institutional Research Information System

Bayesian nonparametric mixture models are widely used to cluster observations. However, one major drawback of the approach is that the estimated partition often presents unbalanced clusters’ frequencies with only a few dominating clusters and a large number of sparsely-populated ones. This feature translates into results that are often uninterpretable unless we accept to ignore a relevant number of observations and clusters. Interpreting the posterior distribution as penalized likelihood, we show how the unbalance can be explained as a direct consequence of the cost functions involved in estimating the partition. In light of our findings, we propose a novel Bayesian estimator of the clustering configuration. The proposed estimator is equivalent to a post-processing procedure that reduces the number of sparsely-populated clusters and enhances interpretability. The procedure takes the form of entropy-regularization of the Bayesian estimate. While being computationally convenient with respect to alternative strategies, it is also theoretically justified as a correction to the Bayesian loss function used for point estimation and, as such, can be applied to any posterior distribution of clusters, regardless of the specific model used.

Entropy regularization in probabilistic clustering

Franzolini Beatrice;Rebaudo Giovanni

2024-01-01

Abstract

Bayesian nonparametric mixture models are widely used to cluster observations. However, one major drawback of the approach is that the estimated partition often presents unbalanced clusters’ frequencies with only a few dominating clusters and a large number of sparsely-populated ones. This feature translates into results that are often uninterpretable unless we accept to ignore a relevant number of observations and clusters. Interpreting the posterior distribution as penalized likelihood, we show how the unbalance can be explained as a direct consequence of the cost functions involved in estimating the partition. In light of our findings, we propose a novel Bayesian estimator of the clustering configuration. The proposed estimator is equivalent to a post-processing procedure that reduces the number of sparsely-populated clusters and enhances interpretability. The procedure takes the form of entropy-regularization of the Bayesian estimate. While being computationally convenient with respect to alternative strategies, it is also theoretically justified as a correction to the Bayesian loss function used for point estimation and, as such, can be applied to any posterior distribution of clusters, regardless of the specific model used.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Titolo rivista
	
				STATISTICAL METHODS & APPLICATIONS
			
	N. Volume
	
				33
			
	Pagine (da)
	
				37
			
	Pagine (a)
	
				60
			
	DOI
	
				https://dx.doi.org/10.1007/s10260-023-00716-y
			
	URL del prodotto (archivi open access, fulltext su sito editore, etc.)
	
				https://link.springer.com/article/10.1007/s10260-023-00716-y#citeas
			
	Parole Chiave
	
				Dirichlet process, Loss functions, Mixture models, Unbalanced clusters, Random partition
			
	Tutti gli autori
	
						Franzolini Beatrice; Rebaudo Giovanni
					
	Appare nelle tipologie:
	
				03A-Articolo su Rivista

File in questo prodotto:

File	Dimensione	Formato
2023SMAP.pdf Accesso aperto Tipo di file: PDF EDITORIALE Dimensione 3.55 MB Formato Adobe PDF Visualizza/Apri	3.55 MB	Adobe PDF	Visualizza/Apri
s10260-023-00716-y.pdf Accesso riservato Tipo di file: POSTPRINT (VERSIONE FINALE DELL’AUTORE) Dimensione 3.58 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	3.58 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1928490

Citazioni

ND

3

2

social impact