Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI

Boldi, Arianna; Gabbatore, Ilaria; Bosco, Francesca M.

doi:10.3390/electronics14224411

Pragmatics concerns how people use language and other expressive means, such as nonverbal and paralinguistic cues, to convey intended meaning in the context. Difficulties in pragmatics are common across distinct clinical conditions, motivating validated assessments such as the Assessment Battery for Communication (ABaCo); whether Large Language Models (LLMs) can serve as reliable coders remains uncertain. In this exploratory study, we used Generative Pre-trained Transformer (GPT)-4o as a rater on 2025 item × dimension units drawn from the responses given by 10 healthy older adults (M = 69.8) to selected ABaCo items. Expert human coders served as the reference standard to compare GPT-4o scores. Agreement metrics included exact agreement, Cohen’s κ, and a discrepancy audit by pragmatic act. Agreement was 89.1% with κ = 0.491. Errors were non-random across acts (χ2(12) = 69.4, p < 0.001). After Benjamini–Hochberg False Discovery Rate correction across 26 cells, only two categories remained significant: false positives concentrated in Command and false negatives in Deceit. Missing prosodic and gestural cues likely exacerbate command-specific failures. In conclusion, in text-only settings, GPT-4o can serve as a supervised second coder for healthy-aging assessments of pragmatic competence, under human oversight. Safe clinical deployment requires population-specific validation and multimodal inputs that recover nonverbal cues.