We present TestiMole-Conversational a massive collection of discussion boards messages in the Italian language. The large size of the corpus, almost 30B word-tokens (1996–2024), brings challenges in the processing and curation of the resource, but it renders it an ideal dataset for native Italian Large Language Models’ pre-training. Furthermore, discussion boards’ messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in a wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication
TestiMole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996–2024) for Language Modeling and Sociolinguistic Research
Matteo Rinaldi
;Rossella Varvara;Viviana Patti
2026-01-01
Abstract
We present TestiMole-Conversational a massive collection of discussion boards messages in the Italian language. The large size of the corpus, almost 30B word-tokens (1996–2024), brings challenges in the processing and curation of the resource, but it renders it an ideal dataset for native Italian Large Language Models’ pre-training. Furthermore, discussion boards’ messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in a wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication| File | Dimensione | Formato | |
|---|---|---|---|
|
2026.cmlc-1.0_testimole.pdf
Accesso aperto
Tipo di file:
PDF EDITORIALE
Dimensione
2.27 MB
Formato
Adobe PDF
|
2.27 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



