CINECA IRIS Institutional Research Information System

The rapid development of Large Language Models (LLMs) has called for robust benchmarks to assess their abilities, track progress, and compare iterations. While existing benchmarks provide extensive evaluations across diverse tasks, they predominantly focus on English, leaving other languages underserved. For Italian, the EVALITA campaigns have provided a long-standing tradition of classification-focused shared tasks. However, their scope does not fully align with the nuanced evaluation required for modern LLMs. To address this gap, we introduce “Challenge the Abilities of LAnguage Models in ITAlian” (CALAMITA), a collaborative effort to create a dynamic and growing benchmark tailored to Italian. CALAMITA emphasizes diversity in task design to test a wide range of LLM capabilities through resources natively developed in Italian by the community. This initiative includes a shared platform, live leaderboard, and centralized evaluation framework. This paper outlines the collaborative process, initial challenges, and evaluation framework of CALAMITA.

CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian

Attanasio G.;Basile P.;Borazio F.;Croce D.;Francis M.;Gili J.;Musacchio E.;Nissim M.;Patti V.;Rinaldi M.;Scalena D.

2024-01-01

Abstract

The rapid development of Large Language Models (LLMs) has called for robust benchmarks to assess their abilities, track progress, and compare iterations. While existing benchmarks provide extensive evaluations across diverse tasks, they predominantly focus on English, leaving other languages underserved. For Italian, the EVALITA campaigns have provided a long-standing tradition of classification-focused shared tasks. However, their scope does not fully align with the nuanced evaluation required for modern LLMs. To address this gap, we introduce “Challenge the Abilities of LAnguage Models in ITAlian” (CALAMITA), a collaborative effort to create a dynamic and growing benchmark tailored to Italian. CALAMITA emphasizes diversity in task design to test a wide range of LLM capabilities through resources natively developed in Italian by the community. This initiative includes a shared platform, live leaderboard, and centralized evaluation framework. This paper outlines the collaborative process, initial challenges, and evaluation framework of CALAMITA.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Titolo dell'evento
	
				10th Italian Conference on Computational Linguistics, CLiC-it 2024
			
	Luogo dell'evento
	
				Pisa, Italia
			
	Data dell'evento
	
				2024
			
	Titolo del volume
	
				Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024), Pisa, Italy, December 4-6, 2024
			
	Nome editore
	
				CEUR-WS
			
	N. Volume
	
				3878
			
	Pagine (da)
	
				1
			
	Pagine (a)
	
				10
			
	URL del prodotto (archivi open access, fulltext su sito editore, etc.)
	
				https://ceur-ws.org/Vol-3878/116_calamita_preface_long.pdf
			
	Parole Chiave
	
				Italian Benchmark; Language Models; Shared Task
			
	Tutti gli autori
	
						Attanasio G.; Basile P.; Borazio F.; Croce D.; Francis M.; Gili J.; Musacchio E.; Nissim M.; Patti V.; Rinaldi M.; Scalena D.
					
	Appare nelle tipologie:
	
				04A-Conference paper in volume

File in questo prodotto:

File	Dimensione	Formato
116_calamita_preface_long (2).pdf Accesso aperto Tipo di file: PDF EDITORIALE Dimensione 430.75 kB Formato Adobe PDF Visualizza/Apri	430.75 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2059272

Citazioni

ND

ND

ND

social impact