Hierarchical Clustering of Label-based Annotator Representations for Mining Perspectives

Lo, Soda Marem; Basile, V.

Modeling annotator perspectives has emerged as a technique to model subjective linguistic phenomena more accurately. Authors in the NLP community approached this issue by creating perspective-aware and personalized models, where demographic data or previous annotations are needed. In this paper, we explore two methodologies to represent annotators solely on the basis of the labels they assigned: label agreement and Kernel PCA. For both these techniques, we computed respectively 5 and 4 clusters, trained perspective-aware models on each of them, and finally implemented majority vote ensembles. The results show that clusters obtained by the first mining technique are more balanced and homogeneous in terms of annotators' demographic traits, while those obtained by KPCA tend to correlate more with their nationalities. Despite these differences, both ensemble models outperform the baseline, confirming that leveraging annotation using clustering techniques is advantageous for the classification of a subjective phenomenon such as irony. We sustain that this approach can be beneficial for taking into account annotators' perspectives when demographic data are not known, together with the possibility that their annotations might be influenced by factors other than given demographics.