The execution of deep neural network (DNN) inference jobs on edge devices has become increasingly popular. Multiple of such inference models can concurrently analyse the on-device data, e.g. images, to extract valuable insights. Prior art focuses on low-power accelerators, compressed neural network architectures, and specialized frameworks to reduce execution time of single inference jobs on edge devices which are resource constrained. However, it is little known how different scheduling policies can further improve the runtime performance of multi-inference jobs without additional edge resources. To enable the exploration of scheduling policies, we first develop an execution framework, EDGECAFFE, which splits the DNN inference jobs by loading and execution of each network layer. We empirically characterize the impact of loading and scheduling policies on the execution time of multi-inference jobs and point out their dependency on the available memory space. We propose a novel memory-aware scheduling policy, MEMA, which opportunistically interleaves the executions of different types of DNN layers based on their estimated run-time memory demands. Our evaluation on exhaustive combinations of five networks, data inputs, and memory configurations show that MEMA can alleviate the degradation of execution times of multi-inference (up to 5x) under severely constrained memory compared to standard scheduling policies without affecting accuracy.

MEMA: Fast Inference of Multiple Deep Models

Birke, R
2021-01-01

Abstract

The execution of deep neural network (DNN) inference jobs on edge devices has become increasingly popular. Multiple of such inference models can concurrently analyse the on-device data, e.g. images, to extract valuable insights. Prior art focuses on low-power accelerators, compressed neural network architectures, and specialized frameworks to reduce execution time of single inference jobs on edge devices which are resource constrained. However, it is little known how different scheduling policies can further improve the runtime performance of multi-inference jobs without additional edge resources. To enable the exploration of scheduling policies, we first develop an execution framework, EDGECAFFE, which splits the DNN inference jobs by loading and execution of each network layer. We empirically characterize the impact of loading and scheduling policies on the execution time of multi-inference jobs and point out their dependency on the available memory space. We propose a novel memory-aware scheduling policy, MEMA, which opportunistically interleaves the executions of different types of DNN layers based on their estimated run-time memory demands. Our evaluation on exhaustive combinations of five networks, data inputs, and memory configurations show that MEMA can alleviate the degradation of execution times of multi-inference (up to 5x) under severely constrained memory compared to standard scheduling policies without affecting accuracy.
2021
IEEE Annual Conference on Pervasive Computing and Communications Workshops (PerCom)
Kassel, Germany
22-26 March 2021
2021 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)
IEEE
281
286
978-1-6654-0424-2
Edge computing; Scheduling; Constrained memory; Memory aware; Multi-inference; Deep neural networks
Galjaard, J; Cox, B; Ghiassi, A; Chen, LY; Birke, R
File in questo prodotto:
File Dimensione Formato  
2021 PerComW MemA_Fast_Inference_of_Multiple_Deep_Models.pdf

Accesso riservato

Dimensione 321.49 kB
Formato Adobe PDF
321.49 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1891101
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 2
social impact