As High Performance Computing (HPC) workflows increase in complexity, their designers seek to enable automation and flexibility offered by cloud technologies. Container orchestration through Kubernetes enables highly desirable capabilities but does not satisfy the performance demands of HPC. Kubernetes tools that automate the lifecycle of Message Passing Interface (MPI)-based applications do not scale, and the Kubernetes scheduler does not provide crucial scheduling capabilities. In this work, we detail our efforts to port CORAL-2 benchmark codes to Kubernetes on IBM Cloud and AWS EKS. We describe contributions to the MPI Operator to achieve 3,000-rank scale, a two-orders-of-magnitude improvement to state of the art. We discuss enhancements to Fluence, our scheduler plugin for Kubernetes based on the next-generation, cloud-ready Flux framework. Finally, we compare the placement decisions of Fluence with those of the Kubernetes scheduler and demonstrate that Fluence allows simulated scientific workflows to achieve up to 3x higher performance.

One Step Closer to Converged Computing: Achieving Scalability with Cloud-Native HPC

Misale, Claudia;
2022-01-01

Abstract

As High Performance Computing (HPC) workflows increase in complexity, their designers seek to enable automation and flexibility offered by cloud technologies. Container orchestration through Kubernetes enables highly desirable capabilities but does not satisfy the performance demands of HPC. Kubernetes tools that automate the lifecycle of Message Passing Interface (MPI)-based applications do not scale, and the Kubernetes scheduler does not provide crucial scheduling capabilities. In this work, we detail our efforts to port CORAL-2 benchmark codes to Kubernetes on IBM Cloud and AWS EKS. We describe contributions to the MPI Operator to achieve 3,000-rank scale, a two-orders-of-magnitude improvement to state of the art. We discuss enhancements to Fluence, our scheduler plugin for Kubernetes based on the next-generation, cloud-ready Flux framework. Finally, we compare the placement decisions of Fluence with those of the Kubernetes scheduler and demonstrate that Fluence allows simulated scientific workflows to achieve up to 3x higher performance.
2022
4th International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC)
Dallas
14 November 2022
Proceedings of CANOPIE-HPC 2022
IEEE COMPUTER SOC
57
70
Cloud computing;Processor scheduling;Scalability;Message passing;High performance computing;Conferences;Benchmark testing;converged computing; cloud native HPC;Kube scheduler;scheduler placement;Fluence;MPI Operator;CORAL 2 benchmarks; MPI; Kubernetes; HPC
Milroy, Daniel J.; Misale, Claudia; Georgakoudis, Giorgis; Elengikal, Tonia; Sarkar, Abhik; Drocco, Maurizio; Patki, Tapasya; Yeom, Jae-Seung; Gutierr...espandi
File in questo prodotto:
File Dimensione Formato  
One_Step_Closer_to_Converged_Computing_Achieving_Scalability_with_Cloud-Native_HPC.pdf

Accesso riservato

Tipo di file: PDF EDITORIALE
Dimensione 939.49 kB
Formato Adobe PDF
939.49 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2030517
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? 2
social impact