Recent studies in Machine Learning advocate for the exploitation of disagreement between annotators to train models in line with the different opinions of humans about a specific phenomenon. This means that datasets where the annotations are aggregated by majority voting are not enough. In this paper, we present an Italian disaggregated dataset concerning hate speech and encoding some information about the annotators: the DisaggregHate It Corpus. The corpus contains Italian tweets that focus on the topic of racism and has been annotated by native Italian university students. We explain how the dataset was gathered by following the recommendation of the perspectivist approach [1], encouraging the annotators to give some socio-demographic information about them. To exploit the disagreement in the learning process, we proposed two types of soft labels: softmax and standard normalization. We investigated the benefit of using disagreement by creating a baseline binary model and two regression models that were respectively trained on the ‘hard’ (aggregated label by majority voting) and the two types of ‘soft’ labels. We tested the models in an in-domain and out-of-domain setting, evaluating their performance using the cross-entropy as a metric, and showing that the models trained on the soft labels performed better.

DisaggregHate It Corpus: A Disaggregated Italian Dataset of Hate Speech

Marco Madeddu;Simona Frenda;Mirko Lai;Viviana Patti;Valerio Basile
2023-01-01

Abstract

Recent studies in Machine Learning advocate for the exploitation of disagreement between annotators to train models in line with the different opinions of humans about a specific phenomenon. This means that datasets where the annotations are aggregated by majority voting are not enough. In this paper, we present an Italian disaggregated dataset concerning hate speech and encoding some information about the annotators: the DisaggregHate It Corpus. The corpus contains Italian tweets that focus on the topic of racism and has been annotated by native Italian university students. We explain how the dataset was gathered by following the recommendation of the perspectivist approach [1], encouraging the annotators to give some socio-demographic information about them. To exploit the disagreement in the learning process, we proposed two types of soft labels: softmax and standard normalization. We investigated the benefit of using disagreement by creating a baseline binary model and two regression models that were respectively trained on the ‘hard’ (aggregated label by majority voting) and the two types of ‘soft’ labels. We tested the models in an in-domain and out-of-domain setting, evaluating their performance using the cross-entropy as a metric, and showing that the models trained on the soft labels performed better.
2023
CLiC-it 2023 Italian Conference on Computational Linguistics
Venezia
December 2023
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)
Federico Boschetti, Gianluca E. Lebani, Bernardo Magnini, Nicole Novielli
3596
1
8
https://ceur-ws.org/Vol-3596/paper29.pdf
hate speech, perspectivism, disagreement
Marco Madeddu, Simona Frenda, Mirko Lai, Viviana Patti, Valerio Basile
File in questo prodotto:
File Dimensione Formato  
paper29.pdf

Accesso aperto

Tipo di file: PDF EDITORIALE
Dimensione 1.05 MB
Formato Adobe PDF
1.05 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1950454
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact