CINECA IRIS Institutional Research Information System

Automatic content moderation is crucial to ensuring safety in social media. Language Model-based classifiers are increasingly adopted for this task, but it has been shown that they perpetuate racial and social biases. Even if several resources and benchmark corpora have been developed to challenge this issue, measuring the fairness of models in content moderation remains an open issue. In this work, we present an unsupervised approach that benchmarks models on the basis of their uncertainty in classifying messages annotated by people belonging to vulnerable groups. We use uncertainty, computed by means of the conformal prediction technique, as a proxy to analyze the bias of 11 models (LMs and LLMs) against women and non-white annotators and observe to what extent it diverges from metrics based on performance, such as the F1 score. The results show that some pre-trained models predict with high accuracy the labels coming from minority groups, even if the confidence in their prediction is low. Therefore, by measuring the confidence of models, we are able to see which groups of annotators are better represented in pre-trained models and lead the debiasing process of these models before their effective use.

Are you sure? Measuring models bias in content moderation through uncertainty

Alessandra Urbinati;Mirko Lai;Simona Frenda;Marco Antonio Stranisci^{Last

Membro del Collaboration Group}

2025-01-01

Abstract

Automatic content moderation is crucial to ensuring safety in social media. Language Model-based classifiers are increasingly adopted for this task, but it has been shown that they perpetuate racial and social biases. Even if several resources and benchmark corpora have been developed to challenge this issue, measuring the fairness of models in content moderation remains an open issue. In this work, we present an unsupervised approach that benchmarks models on the basis of their uncertainty in classifying messages annotated by people belonging to vulnerable groups. We use uncertainty, computed by means of the conformal prediction technique, as a proxy to analyze the bias of 11 models (LMs and LLMs) against women and non-white annotators and observe to what extent it diverges from metrics based on performance, such as the F1 score. The results show that some pre-trained models predict with high accuracy the labels coming from minority groups, even if the confidence in their prediction is low. Therefore, by measuring the confidence of models, we are able to see which groups of annotators are better represented in pre-trained models and lead the debiasing process of these models before their effective use.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Titolo dell'evento
	
				The 2025 Conference on Empirical Methods in Natural Language Processing
			
	Luogo dell'evento
	
				Suzhou, China
			
	Data dell'evento
	
				4/11/2025-9/11/2025
			
	Titolo del volume
	
				Findings of the Association for Computational Linguistics: EMNLP 2025
			
	Nome editore
	
				Association for Computational Linguistics (ACL)
			
	Pagine (da)
	
				18061
			
	Pagine (a)
	
				18076
			
	URL del prodotto (archivi open access, fulltext su sito editore, etc.)
	
				https://aclanthology.org/2025.findings-emnlp.980/
			
	Parole Chiave
	
				bias detection, uncertainty, conformal prediction, content moderation
			
	Tutti gli autori
	
						Alessandra Urbinati, Mirko Lai, Simona Frenda, Marco Antonio Stranisci
					
	Appare nelle tipologie:
	
				04A-Conference paper in volume

File in questo prodotto:

File	Dimensione	Formato
12_sure.pdf Accesso aperto Tipo di file: PDF EDITORIALE Dimensione 7.95 MB Formato Adobe PDF Visualizza/Apri	7.95 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2102661

Citazioni

ND

ND

ND

social impact