CINECA IRIS Institutional Research Information System

We present TestiMole-Conversational a massive collection of discussion boards messages in the Italian language. The large size of the corpus, almost 30B word-tokens (1996–2024), brings challenges in the processing and curation of the resource, but it renders it an ideal dataset for native Italian Large Language Models’ pre-training. Furthermore, discussion boards’ messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in a wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication

TestiMole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996–2024) for Language Modeling and Sociolinguistic Research

Matteo Rinaldi;Rossella Varvara;Viviana Patti

2026-01-01

Abstract

We present TestiMole-Conversational a massive collection of discussion boards messages in the Italian language. The large size of the corpus, almost 30B word-tokens (1996–2024), brings challenges in the processing and curation of the resource, but it renders it an ideal dataset for native Italian Large Language Models’ pre-training. Furthermore, discussion boards’ messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in a wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Titolo dell'evento
	
				The 12th Workshop on Challenges in the Management of Large Corpora (CMLC-12) @ LREC 2026
			
	Luogo dell'evento
	
				Palma, Mallorca, Spain
			
	Data dell'evento
	
				May 11, 2026
			
	Titolo del volume
	
				Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora (CMLC-12) @ LREC 2026
			
	Nome editore
	
				ELRA Language Resources Association
			
	Pagine (da)
	
				1
			
	Pagine (a)
	
				11
			
	Codice ISBN
	
				978-2-493814-67-8
			
	URL del prodotto (archivi open access, fulltext su sito editore, etc.)
	
				http://lrec-conf.org/proceedings/lrec2026/workshops/cmlc/2026.cmlc-1.0.pdf
			
	Parole Chiave
	
				Italian language corpus, pre-training data, discussion forums, diachronic corpus
			
	Tutti gli autori
	
						Matteo Rinaldi, Rossella Varvara, Viviana Patti
					
	Appare nelle tipologie:
	
				04A-Conference paper in volume

File in questo prodotto:

File	Dimensione	Formato
2026.cmlc-1.0_testimole.pdf Accesso aperto Tipo di file: PDF EDITORIALE Dimensione 2.27 MB Formato Adobe PDF Visualizza/Apri	2.27 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2139730

Citazioni

ND

ND

ND

social impact