Pragmatics concerns how people use language and other expressive means, such as nonverbal and paralinguistic cues, to convey intended meaning in the context. Difficulties in pragmatics are common across distinct clinical conditions, motivating validated assessments such as the Assessment Battery for Communication (ABaCo); whether Large Language Models (LLMs) can serve as reliable coders remains uncertain. In this exploratory study, we used Generative Pre-trained Transformer (GPT)-4o as a rater on 2025 item × dimension units drawn from the responses given by 10 healthy older adults (M = 69.8) to selected ABaCo items. Expert human coders served as the reference standard to compare GPT-4o scores. Agreement metrics included exact agreement, Cohen’s κ, and a discrepancy audit by pragmatic act. Agreement was 89.1% with κ = 0.491. Errors were non-random across acts (χ2(12) = 69.4, p < 0.001). After Benjamini–Hochberg False Discovery Rate correction across 26 cells, only two categories remained significant: false positives concentrated in Command and false negatives in Deceit. Missing prosodic and gestural cues likely exacerbate command-specific failures. In conclusion, in text-only settings, GPT-4o can serve as a supervised second coder for healthy-aging assessments of pragmatic competence, under human oversight. Safe clinical deployment requires population-specific validation and multimodal inputs that recover nonverbal cues.

Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI

Boldi, Arianna
First
;
Gabbatore, Ilaria;Bosco, Francesca M.
Last
2025-01-01

Abstract

Pragmatics concerns how people use language and other expressive means, such as nonverbal and paralinguistic cues, to convey intended meaning in the context. Difficulties in pragmatics are common across distinct clinical conditions, motivating validated assessments such as the Assessment Battery for Communication (ABaCo); whether Large Language Models (LLMs) can serve as reliable coders remains uncertain. In this exploratory study, we used Generative Pre-trained Transformer (GPT)-4o as a rater on 2025 item × dimension units drawn from the responses given by 10 healthy older adults (M = 69.8) to selected ABaCo items. Expert human coders served as the reference standard to compare GPT-4o scores. Agreement metrics included exact agreement, Cohen’s κ, and a discrepancy audit by pragmatic act. Agreement was 89.1% with κ = 0.491. Errors were non-random across acts (χ2(12) = 69.4, p < 0.001). After Benjamini–Hochberg False Discovery Rate correction across 26 cells, only two categories remained significant: false positives concentrated in Command and false negatives in Deceit. Missing prosodic and gestural cues likely exacerbate command-specific failures. In conclusion, in text-only settings, GPT-4o can serve as a supervised second coder for healthy-aging assessments of pragmatic competence, under human oversight. Safe clinical deployment requires population-specific validation and multimodal inputs that recover nonverbal cues.
2025
14
22
1
16
https://www.mdpi.com/2079-9292/14/22/4411
pragmatic assessment; large language models; human–AI collaboration; digital health; healthy aging
Boldi, Arianna; Gabbatore, Ilaria; Bosco, Francesca M.
File in questo prodotto:
File Dimensione Formato  
2025_Boldi et al. ABaCo&ChatGPT_Electronics.pdf

Accesso aperto

Tipo di file: PDF EDITORIALE
Dimensione 309.24 kB
Formato Adobe PDF
309.24 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2105995
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact