Supercomputing Infrastructure as Code to Enable Exascale Workflows

Supercomputing Infrastructure as Code to Enable Exascale Workflows

Wednesday, June 30, 2021 12:40 PM to 1:00 PM · 20 min. (Africa/Abidjan)
Stream#2
HPC Workflows

Information

Contributors:
Abstract:

Supercomputing Infrastructure as code (IaC) offers a potential for secure and flexible mapping of diverse scientific workflows by enabling programmability of high-performance computing, networking and storage systems. IaC has been a common place in cloud computing, however, it lacked adoption within multi-Petascale ecosystems due to concerns related to performance and interoperability with classic HPC data centres’ ecosystems. This talk overviews traditional Supercomputing tools for managing workflows, specifically their constraints for incorporating machine learning and experimental data-driven computing, networking and storage requirements. Two areas are compared for supporting an operational weather forecasting platform: (1) multiple tenants for isolation and security; and (2) DevOps processes for continuous integration and deployment (CI/CD). MeteoSwiss was first among the operational weather forecasting centres to introduce a GPU based platform in 2015, which is composed of a self-contained operational and a failover system. The performance goals for the new model and its budget prevented such level of hardware redundancy. A software defined networking approach is introduced to flexibly manage multiple cluster instances or tenants including an operational cluster, a research and development cluster and a test and development cluster. Due to the strict validity requirements of MeteoSwiss operational suite, the DevOps cycles are challenging due to dependencies between the operating environment, GPU stack, and MeteoSwiss operational software stack. Classic and agile toolchains are adapted to support CI/CD pipelines of the MeteoSwiss operational suite. Results demonstrate that the cost-to-performance, availability and reliability metrics of the MeteoSwiss platform are improved by over 50% compared to its predecessor infrastructure.