The development of computational methods to detect abusive language in social media within variable and multilingual contexts has recently gained significant traction. The growing interest is confirmed by the large number of benchmark corpora for different languages developed in the latest years. However, abusive language behaviour is multifaceted and available datasets are featured by different topical focuses. This makes abusive language detection a domain-dependent task, and building a robust system to detect general abusive content a first challenge. Moreover, most resources are available for English, which makes detecting abusive language in low-resource languages a further challenge. We address both challenges by considering ten publicly available datasets across different domains and languages. A hybrid approach with deep learning and a multilingual lexicon to cross-domain and cross-lingual detection of abusive content is proposed and compared with other simpler models. We show that training a system on general abusive language datasets will produce a cross-domain robust system, which can be used to detect other more specific types of abusive content. We also found that using the domain-independent lexicon HurtLex is useful to transfer knowledge between domains and languages. In the cross-lingual experiment, we demonstrate the effectiveness of our joint learning model also in out-domain scenarios.

Cross-domain and Cross-lingual abusive language detection: A hybrid approach with deep learning and a multilingual lexicon

Pamungkas E.
;
Patti V.
2019-01-01

Abstract

The development of computational methods to detect abusive language in social media within variable and multilingual contexts has recently gained significant traction. The growing interest is confirmed by the large number of benchmark corpora for different languages developed in the latest years. However, abusive language behaviour is multifaceted and available datasets are featured by different topical focuses. This makes abusive language detection a domain-dependent task, and building a robust system to detect general abusive content a first challenge. Moreover, most resources are available for English, which makes detecting abusive language in low-resource languages a further challenge. We address both challenges by considering ten publicly available datasets across different domains and languages. A hybrid approach with deep learning and a multilingual lexicon to cross-domain and cross-lingual detection of abusive content is proposed and compared with other simpler models. We show that training a system on general abusive language datasets will produce a cross-domain robust system, which can be used to detect other more specific types of abusive content. We also found that using the domain-independent lexicon HurtLex is useful to transfer knowledge between domains and languages. In the cross-lingual experiment, we demonstrate the effectiveness of our joint learning model also in out-domain scenarios.
2019
57th Annual Meeting of the Association for Computational Linguistics, ACL 2019 - Student Research Workshop, SRW 2019
Florence, Italy
2019
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Association for Computational Linguistics (ACL)
363
370
9781950737475
https://aclanthology.org/P19-2051/
abusive language detection, multilinguality, social media, hate lexicons, deep learning, cross-domain experiments
Pamungkas E.; Patti V.
File in questo prodotto:
File Dimensione Formato  
P19-2051.pdf

Accesso riservato

Tipo di file: PDF EDITORIALE
Dimensione 303.78 kB
Formato Adobe PDF
303.78 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1757917
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 77
  • ???jsp.display-item.citation.isi??? 43
social impact