The rapid advancement of Large Language Models (LLMs) has highlighted the need for robust tools to evaluate them correctly. A major challenge in developing models that serve non-English speakers lies in the predominance of benchmarks that are either in English or machine translated from it. Evaluating the performance of multilingual or language-specific models requires native-language resources. In this paper, we present EVALITA-LLM a benchmark entirely composed of datasets in native Italian and adjusted to assess LLMs capabilities. The benchmark consists of 10 tasks that cover key aspects of NLP. We also provide prompts for all tasks that are designed to follow specific criteria. In order to avoid prompt sensibility, the evaluation of the models considers different methodologies to combine the scores obtained on different prompts.

Evaluating large language models on Italian tasks

Bernardo Magnini
;
Marco Madeddu;Viviana Patti
2025-01-01

Abstract

The rapid advancement of Large Language Models (LLMs) has highlighted the need for robust tools to evaluate them correctly. A major challenge in developing models that serve non-English speakers lies in the predominance of benchmarks that are either in English or machine translated from it. Evaluating the performance of multilingual or language-specific models requires native-language resources. In this paper, we present EVALITA-LLM a benchmark entirely composed of datasets in native Italian and adjusted to assess LLMs capabilities. The benchmark consists of 10 tasks that cover key aspects of NLP. We also provide prompts for all tasks that are designed to follow specific criteria. In order to avoid prompt sensibility, the evaluation of the models considers different methodologies to combine the scores obtained on different prompts.
2025
Thematic Workshops at Ital-IA 2025
Trieste, Italy
June 23-24, 2025
Joint Proceedings of the Thematic Workshops at Ital-IA 2025 colocated with the 5th National Conference on Artificial Intelligence, organized by CINI (Ital-IA 2025)
CEUR Workshop Proceedings
4121
1
6
https://ceur-ws.org/Vol-4121/Ital-IA_2025_paper_112.pdf
Benchmark, Italian, Evaluation, Large Language Models
Bernardo Magnini, Roberto Zanoli, Michele Resta, Martin Cimmino, Paolo Albano, Marco Madeddu, Viviana Patti
File in questo prodotto:
File Dimensione Formato  
Ital-IA_2025_paper_112.pdf

Accesso aperto

Tipo di file: PDF EDITORIALE
Dimensione 950.75 kB
Formato Adobe PDF
950.75 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2121312
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact