Topic classification is the task of mapping text onto a set of meaningful labels known beforehand. This scenario is very common both in academia and industry whenever there is the need of categorizing a big corpus of documents according to set custom labels. The standard supervised approach, however, requires thousands of documents to be manually labelled, and additional effort every time the label taxonomy changes. To obviate these downsides, we investigated the application of a zero-shot approach to topic classification. In this setting, a subset of these topics, or even all of them, is not seen at training time, challenging the model to classify corresponding examples using additional information. We first show how zero-shot classification can perform the topic-classification task without any supervision. Secondly, we build a novel hazard-detection dataset by manually selecting tweets gathered by LINKS Foundation for this task, where we demonstrate the effectivenes of our cost-free method on a real-world problem. The idea is to leverage a pre-trained text-embedder (MPNet) to map both text and topics into the same semantic vector space where they can be compared. We demonstrate that these semantic spaces are better aligned when their dimension is reduced, keeping only the most useful information. We investigated three different dimensionality reduction techniques, namely, linear projection, autoencoding and PCA. Using the macro F1-score as the standard metric, it was found that PCA is the best performing technique, recording improvements for each dataset in comparison with the performance on the baseline.

Zero-Shot Topic Labeling for Hazard Classification

Basile, Valerio
2022-01-01

Abstract

Topic classification is the task of mapping text onto a set of meaningful labels known beforehand. This scenario is very common both in academia and industry whenever there is the need of categorizing a big corpus of documents according to set custom labels. The standard supervised approach, however, requires thousands of documents to be manually labelled, and additional effort every time the label taxonomy changes. To obviate these downsides, we investigated the application of a zero-shot approach to topic classification. In this setting, a subset of these topics, or even all of them, is not seen at training time, challenging the model to classify corresponding examples using additional information. We first show how zero-shot classification can perform the topic-classification task without any supervision. Secondly, we build a novel hazard-detection dataset by manually selecting tweets gathered by LINKS Foundation for this task, where we demonstrate the effectivenes of our cost-free method on a real-world problem. The idea is to leverage a pre-trained text-embedder (MPNet) to map both text and topics into the same semantic vector space where they can be compared. We demonstrate that these semantic spaces are better aligned when their dimension is reduced, keeping only the most useful information. We investigated three different dimensionality reduction techniques, namely, linear projection, autoencoding and PCA. Using the macro F1-score as the standard metric, it was found that PCA is the best performing technique, recording improvements for each dataset in comparison with the performance on the baseline.
2022
13
10
1
12
https://www.mdpi.com/2078-2489/13/10/444
Rondinelli, Andrea ; Bongiovanni, Lorenzo ; Basile, Valerio
File in questo prodotto:
File Dimensione Formato  
information-13-00444-v2.pdf

Accesso aperto

Tipo di file: PDF EDITORIALE
Dimensione 294.81 kB
Formato Adobe PDF
294.81 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1878899
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
social impact