CINECA IRIS Institutional Research Information System

In a decade, AI frontier research transitioned from the researcher's workstation to thousands of high-end hardware-accelerated compute nodes. This rapid evolution shows no signs of slowing down in the foreseeable future. While top cloud providers may be able to keep pace with this growth rate, obtaining and efficiently exploiting computing resources at that scale is a daunting challenge for universities and SMEs. This work introduces the Cross-Facility Federated Learning (XFFL) framework to bridge this compute divide, extending the opportunity to efficiently exploit multiple independent data centres for extreme-scale deep learning tasks to data scientists and domain experts. XFFL relies on hybrid workflow abstractions to decouple tasks from environment-specific technicalities, reducing complexity and enhancing reusability. In addition, Federated Learning (FL) algorithms eliminate the need to move large amounts of data between different facilities, reducing time-to-solution and preserving data privacy. The XFFL approach is empirically evaluated by training a full LLaMAv2 7B instance on two facilities of the EuroHPC JU, showing how the increased computing power completely compensates for the additional overhead introduced by two data centres.

Cross-Facility Federated Learning

Iacopo Colonnelli^First;Robert Birke;Giulio Malenza;Gianluca Mittone;Alberto Mulone;Jeroen Galjaard;Lydia Y. Chen;Sanzio Bassini;Gabriella Scipione;Jan Martinovič;Vit Vondrák;Marco Aldinucci^Last

2024-01-01

Abstract

In a decade, AI frontier research transitioned from the researcher's workstation to thousands of high-end hardware-accelerated compute nodes. This rapid evolution shows no signs of slowing down in the foreseeable future. While top cloud providers may be able to keep pace with this growth rate, obtaining and efficiently exploiting computing resources at that scale is a daunting challenge for universities and SMEs. This work introduces the Cross-Facility Federated Learning (XFFL) framework to bridge this compute divide, extending the opportunity to efficiently exploit multiple independent data centres for extreme-scale deep learning tasks to data scientists and domain experts. XFFL relies on hybrid workflow abstractions to decouple tasks from environment-specific technicalities, reducing complexity and enhancing reusability. In addition, Federated Learning (FL) algorithms eliminate the need to move large amounts of data between different facilities, reducing time-to-solution and preserving data privacy. The XFFL approach is empirically evaluated by training a full LLaMAv2 7B instance on two facilities of the EuroHPC JU, showing how the increased computing power completely compensates for the additional overhead introduced by two data centres.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Titolo dell'evento
	
				First EuroHPC User Day
			
	Luogo dell'evento
	
				Brussels, Belgium
			
	Data dell'evento
	
				11/09/2023
			
	N. Volume
	
				240
			
	Pagine (da)
	
				3
			
	Pagine (a)
	
				12
			
	Titolo della rivista
	
				PROCEDIA COMPUTER SCIENCE
			
	DOI
	
				https://dx.doi.org/10.1016/j.procs.2024.07.003
			
	URL del prodotto (archivi open access, fulltext su sito editore, etc.)
	
				https://www.sciencedirect.com/science/article/pii/S1877050924016909
			
	Parole Chiave
	
				Federated Learning, High-Performance Computing, Cross-Facility Computing, Hybrid Workflows, StreamFlow
			
	Tutti gli autori
	
						Iacopo Colonnelli, Robert Birke, Giulio Malenza, Gianluca Mittone, Alberto Mulone, Jeroen Galjaard, Lydia Y. Chen, Sanzio Bassini, Gabriella Scipione,...espandi
						
	Appare nelle tipologie:
	
				04B-Conference paper in rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S1877050924016909-main.pdf Accesso aperto Descrizione: PDF Editoriale Tipo di file: PDF EDITORIALE Dimensione 482.62 kB Formato Adobe PDF Visualizza/Apri	482.62 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2003510

Citazioni

ND

0

ND

social impact