Vision Transformer for femur fracture classification

Tanzi, L.; Audisio, A.; Cirrincione, G.; Aprato, A.; Vezzetti, E.

doi:10.1016/j.injury.2022.04.013

Introduction: In recent years, the scientific community focused on developing Computer-Aided Diagnosis (CAD) tools that could improve clinicians’ bone fractures diagnosis, primarily based on Convolutional Neural Networks (CNNs). However, the discerning accuracy of fractures’ subtypes was far from optimal. The aim of the study was 1) to evaluate a new CAD system based on Vision Transformers (ViT), a very recent and powerful deep learning technique, and 2) to assess whether clinicians’ diagnostic accuracy could be improved using this system. Materials and methods: 4207 manually annotated images were used and distributed, by following the AO/OTA classification, in different fracture types. The ViT architecture was used and compared with a classic CNN and a multistage architecture composed of successive CNNs. To demonstrate the reliability of this approach, (1) the attention maps were used to visualize the most relevant areas of the images, (2) the performance of a generic CNN and ViT was compared through unsupervised learning techniques, and (3) 11 clinicians were asked to evaluate and classify 150 proximal femur fractures’ images with and without the help of the ViT, then results were compared for potential improvement. Results: The ViT was able to predict 83% of the test images correctly. Precision, recall and F1-score were 0.77 (CI 0.64–0.90), 0.76 (CI 0.62–0.91) and 0.77 (CI 0.64–0.89), respectively. The clinicians’ diagnostic improvement was 29% (accuracy 97%; p 0.003) when supported by ViT's predictions, outperforming the algorithm alone. Conclusions: This paper showed the potential of Vision Transformers in bone fracture classification. For the first time, good results were obtained in sub-fractures classification, outperforming the state of the art. Accordingly, the assisted diagnosis yielded the best results, proving the effectiveness of collaborative work between neural networks and clinicians.