Challenges of a HPC cloud journey

Tuesday, May 31, 2022 2:15 PM to 2:35 PM · 20 min. (Europe/Berlin)

Hall 4 - Ground Floor

HPC Workflows

Information

PGS is moving away from a CapEx (investment) to a OpEx (running cost) model for the majority of its HPC environment by shifting significant compute capability from our own datacenters and to the Google cloud (GPC). Our first approach started with a lift and shift proof of concept (POC) with the entire system stack running on GCP. It was abandoned as shifting both compute and storage to the cloud proved uneconomical and gave a poor user experience. The current effort uses a hybrid model where we integrate on-premises infrastructure with capabilities in the cloud by keeping storage on-prem and focusing on compute in the cloud. POC was delivered in early 2020. Roughly 70% of our current compute needs (325.000 cores) are as of May 22nd shifted to the cloud by optimizing 8 of our most compute intensive algorithms for cloud runs. The focus during the rewrite is to be as platform agnostic as we possibly can thus being able to run on multiple platforms without further due. We have chosen Kubernetes to support this approach. Our commitment to the strategy is reflected in us decommissioning 2 datacenters as well as our now aging CRAY supercomputers or some 250.000 cores. The journey has not been completely without bumps, and I will share some of the experience that we made during the way.

Format

On-siteLive-Online