Swearing plays an ubiquitous role in everyday conversations among humans, both in oral and textual communication, and occurs frequently in social media texts, typically featured by informal language and spontaneous writing. Such occurrences can be linked to an abusive context, when they contribute to the expression of hatred and to the abusive effect, causing harm and offense. However, swearing is multifaceted and is often used in casual contexts, also with positive social functions. In this study, we explore the phenomenon of swearing in Twitter conversations, taking the possibility of predicting the abusiveness of a swear word in a tweet context as the main investigation perspective. We developed the Twitter English corpus SWAD (Swear Words Abusiveness Dataset), where abusive swearing is manually annotated at the word level. Our collection consists of 1,511 unique swear words from 1,320 tweets. We developed models to automatically predict abusive swearing, to provide an intrinsic evaluation of SWAD and confirm the robustness of the resource. We also present the results of a glass box ablation study in order to investigate which lexical, syntactic, and affective features are more informative towards the automatic prediction of the function of swearing.

Do You Really Want to Hurt Me? Predicting Abusive Swearing in Social Media

Endang Wahyu Pamungkas
;
Valerio Basile;Viviana Patti
2020-01-01

Abstract

Swearing plays an ubiquitous role in everyday conversations among humans, both in oral and textual communication, and occurs frequently in social media texts, typically featured by informal language and spontaneous writing. Such occurrences can be linked to an abusive context, when they contribute to the expression of hatred and to the abusive effect, causing harm and offense. However, swearing is multifaceted and is often used in casual contexts, also with positive social functions. In this study, we explore the phenomenon of swearing in Twitter conversations, taking the possibility of predicting the abusiveness of a swear word in a tweet context as the main investigation perspective. We developed the Twitter English corpus SWAD (Swear Words Abusiveness Dataset), where abusive swearing is manually annotated at the word level. Our collection consists of 1,511 unique swear words from 1,320 tweets. We developed models to automatically predict abusive swearing, to provide an intrinsic evaluation of SWAD and confirm the robustness of the resource. We also present the results of a glass box ablation study in order to investigate which lexical, syntactic, and affective features are more informative towards the automatic prediction of the function of swearing.
2020
The 12th Language Resources and Evaluation Conference
Marseille, France
May, 2020
Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020)
European Language Resources Association
6237
6246
https://www.aclweb.org/anthology/2020.lrec-1.765
swearing, social media, abusive language detection
Endang Wahyu Pamungkas, Valerio Basile, Viviana Patti
File in questo prodotto:
File Dimensione Formato  
2020.lrec-1.765.pdf

Accesso aperto

Tipo di file: PDF EDITORIALE
Dimensione 1.8 MB
Formato Adobe PDF
1.8 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1757919
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 27
  • ???jsp.display-item.citation.isi??? 19
social impact