Potential for AI as first reader in lung cancer screening

Ledda, Roberta Eufrasia; Valsecchi, Camilla; Sabia, Federica; Milanese, Gianluca; Balbi, Maurizio; Rolli, Luigi; Ruggirello, Margherita; Sverzellati, Nicola; Marchianò, Alfonso Vittorio; Pastorino, Ugo

doi:10.1016/j.ejrad.2025.112561

Purpose: To retrospectively assess the agreement between human and automated AI-based readings for low-dose computed tomography (LDCT) outcomes according to LungRADS v1.1 in lung cancer screening (LCS); to test the diagnostic performance of both readings. Methods: We included 4104 baseline LDCTs from the BioMILD trial. Original readings were retrospectively classified into "negative" (LungRADSv1.1 categories 1, 2) and "positive" (categories 3, 4) by a radiologist and analyzed by AI software for category assignment. Diagnosis of lung cancer (LC) at 2 years served as reference standard to assess sensitivity, specificity, negative predictive value (NPV), and positive predictive value (PPV) of both human and AI. Agreement between readers was measured by the k-Cohen Index with Fleiss-Cohen weights (Kw) with 95 % CI. Results: Median age of participants was 60 years; 60.8 % were male and 79.2 % current smokers; 68/4104 (1.7 %) were diagnosed with LC; 6/68 (8.8 %) and 7/68 (10.3 %) LDCT were classified as negative by AI and human reading, respectively. The agreement between human and AI readings for negative and positive LDCTs was 83.5 % (Kw 0.47; 95 %CI: 0.43-0.50). Sensitivity and specificity were 91.2 % and 75.7 % for AI, and 89.7 % and 90.0 % for human reading (p-value 0.5637 and < 0.0001). PPV and NPV were 6.0 % and 99.8 % for AI, and 13.1 % and 99.8 % for human reading (p-value < 0.0001 and 0.9351). The expected reduction in LDCT reading workload when using AI as first reader was 74.7 %. Conclusion: AI reading showed comparable sensitivity but lower specificity than human reading. High NPV of AI may support its use as a first reader in LCS.