TinyML model compression: A comparative study of pruning and quantization on selected standard and custom neural networks

Shabir, M. Y.; Torta, G.; Damiani, F.

doi:10.1007/s11235-025-01363-2

In Machine Learning (ML), the deployment of complex Neural Network (NN) models on memory-constrained Internet of Things (IoT) devices presents a significant challenge. Tiny Machine Learning (TinyML) focuses on optimizing NN models for such environments, where computational and storage resources are limited. A major aspect of this optimization involves reducing model size without substantially compromising accuracy. We conducted a systematic literature review to identify pruning and quantization techniques suitable for optimization of NN models. In addition, this study investigates the efficiency of pruning and 8-bit integer (INT8) quantization in optimizing NN models for deployment on memory-constrained devices. The study evaluates widely used NN architectures such as ResNet50/101, VGG16, and MobileNet, alongside a custom-designed model, using CIFAR-100, CIFAR-10, MNIST, and Fashion-MNIST datasets. The results show that combining pruning with INT8 quantization reduced the size of MobileNet by 77.01% and the custom model by 94.38%. Notably, the custom model achieved improved accuracy, while MobileNet retained competitive accuracy with minimal loss on CIFAR-100. The main contribution of this work lies in systematically analyzing and comparing pruning, INT8 quantization, and hybrid optimization methods across multiple architectures and datasets, with performance evaluated in terms of recall, latency, and memory requirements before and after optimization. Pruning and INT8 quantization reduced model size and inference time while preserving accuracy for TinyML deployment. These findings highlight practical approaches for enabling efficient TinyML deployment in real-world IoT applications.