As deep learning techniques become more and more popular, there is the need to move these applications from the data scientist’s Jupyter notebook to efficient and reliable enterprise solutions. Moreover, distributed training of deep learning models will happen more and more outside the well-known borders of cloud and HPC infrastructure and will move to edge and mobile platforms. Current techniques for distributed deep learning have drawbacks in both these scenarios, limiting their long-term applicability. After a critical review of the established techniques for Data Parallel training from both a distributed computing and deep learning perspective, a novel approach based on nearest-neighbour communications is presented in order to overcome some of the issues related to mainstream approaches, such as global communication patterns. Moreover, in order to validate the proposed strategy, the Flexible Asynchronous Scalable Training (FAST) framework is introduced, which allows to apply the nearest-neighbours communications approach to a deep learning framework of choice. Finally, a relevant use-case is deployed on a medium-scale infrastructure to demonstrate both the framework and the methodology presented. Training convergence and scalability results are presented and discussed in comparison to a baseline defined by using state-of-the-art distributed training tools provided by a well-known deep learning framework.

Deep Learning at Scale with Nearest Neighbours Communications

Paolo Viviani
2019-01-01

Abstract

As deep learning techniques become more and more popular, there is the need to move these applications from the data scientist’s Jupyter notebook to efficient and reliable enterprise solutions. Moreover, distributed training of deep learning models will happen more and more outside the well-known borders of cloud and HPC infrastructure and will move to edge and mobile platforms. Current techniques for distributed deep learning have drawbacks in both these scenarios, limiting their long-term applicability. After a critical review of the established techniques for Data Parallel training from both a distributed computing and deep learning perspective, a novel approach based on nearest-neighbour communications is presented in order to overcome some of the issues related to mainstream approaches, such as global communication patterns. Moreover, in order to validate the proposed strategy, the Flexible Asynchronous Scalable Training (FAST) framework is introduced, which allows to apply the nearest-neighbours communications approach to a deep learning framework of choice. Finally, a relevant use-case is deployed on a medium-scale infrastructure to demonstrate both the framework and the methodology presented. Training convergence and scalability results are presented and discussed in comparison to a baseline defined by using state-of-the-art distributed training tools provided by a well-known deep learning framework.
https://zenodo.org/record/3516093/files/20190910_final_pdf.pdf
deep learning, distributed training, stencil
Paolo Viviani
File in questo prodotto:
File Dimensione Formato  
20190910_final_pdf.pdf

Accesso aperto

Descrizione: tesi
Tipo di file: PDF EDITORIALE
Dimensione 2.29 MB
Formato Adobe PDF
2.29 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1714931
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact