CINECA IRIS Institutional Research Information System

We present Evalita-LLM, a comprehensive benchmark and leaderboard designed to evaluate Large Language Models (LLMs) on Italian tasks. Evalita-LLM covers ten native Italian tasks, including both multiple-choice and generative formats, and enables fair and transparent comparisons by using multiple prompts per task, addressing LLMs’ sensitivity to prompt phrasing. The leaderboard supports both zero-shot and few-shot evaluation settings and currently reports results for 23 open-source models. Our findings show consistent performance improvements with few-shot prompting and larger model sizes. Additionally, more recent versions of LLMs generally outperform their predecessors. However, no single model excels across all tasks, which highlights the task-dependent nature of LLM performance. Notably, generative tasks remain significantly more challenging than multiple-choice ones. Hosted on Hugging Face, the Evalita-LLM leaderboard offers a public and continuously updated platform for benchmarking and transparent evaluation of LLMs.

A Leaderboard for Benchmarking LLMs on Italian

Bernardo Magnini;Marco Madeddu;Michele Resta;Roberto Zanoli;Martin Cimmino;Paolo Albano;Viviana Patti

2025-01-01

Abstract

We present Evalita-LLM, a comprehensive benchmark and leaderboard designed to evaluate Large Language Models (LLMs) on Italian tasks. Evalita-LLM covers ten native Italian tasks, including both multiple-choice and generative formats, and enables fair and transparent comparisons by using multiple prompts per task, addressing LLMs’ sensitivity to prompt phrasing. The leaderboard supports both zero-shot and few-shot evaluation settings and currently reports results for 23 open-source models. Our findings show consistent performance improvements with few-shot prompting and larger model sizes. Additionally, more recent versions of LLMs generally outperform their predecessors. However, no single model excels across all tasks, which highlights the task-dependent nature of LLM performance. Notably, generative tasks remain significantly more challenging than multiple-choice ones. Hosted on Hugging Face, the Evalita-LLM leaderboard offers a public and continuously updated platform for benchmarking and transparent evaluation of LLMs.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Titolo dell'evento
	
				Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)
			
	Luogo dell'evento
	
				Cagliari, Italy
			
	Data dell'evento
	
				September 24 — 26, 2025
			
	Titolo del volume
	
				Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)
			
	Nome editore
	
				CEUR Workshop Proceedings
			
	N. Volume
	
				4112
			
	Pagine (da)
	
				636
			
	Pagine (a)
	
				646
			
	Codice ISBN
	
				979-12-243-0587-3
			
	URL del prodotto (archivi open access, fulltext su sito editore, etc.)
	
				https://aclanthology.org/2025.clicit-1.61/
			
	Parole Chiave
	
				LLMs, Benchmarking, Leaderboard
			
	Tutti gli autori
	
						Bernardo Magnini,
Marco Madeddu,
Michele Resta,
Roberto Zanoli,
Martin Cimmino,
Paolo Albano,
Viviana Patti
					
	Appare nelle tipologie:
	
				04A-Conference paper in volume

File in questo prodotto:

File	Dimensione	Formato
2025.clicit-1.61.pdf Accesso aperto Tipo di file: PDF EDITORIALE Dimensione 1.12 MB Formato Adobe PDF Visualizza/Apri	1.12 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2121310

Citazioni

ND

ND

ND

social impact