Accelerating GAN Training Using Cloud Native Technologies and the Public Cloud

Tuesday, June 29, 2021 1:15 PM to 1:30 PM · 15 min. (Africa/Abidjan)

Information

With the increasing number of Machine and Deep Learning applications in High Energy Physics, easy access to dedicated infrastructure represents a requirement for fast and efficient R&D. This work explores using cloud native technologies to ease and improve the training of a Generative Adversarial Network (GAN). We present experiences taking user workloads and the required modifications to have them run at scale on this type of infrastructure. More specifically, we rely on Kubernetes, Kubeflow and projects in the CNCF ecosystem to parallelize the training process on multiple GPUs and Google Tensor Processing Units (TPU) and we compare two algorithms: the TensorFlow built-in logic and a custom loop, optimised to have higher control of the elements assigned to each GPU worker or TPU core. The quality of the generated data is compared to Monte Carlo simulation. Close to linear speed-up of the training process is obtained, while retaining most of the performance in terms of physics results. Additionally, we present experiences benchmarking the aforementioned approaches, at scale, over multiple GPU nodes, deploying the training process on different public cloud providers, seeking for overall efficiency and cost-effectiveness. The combination of data science, cloud native deployments and associated economics allows to burst out heterogeneously, exploring the full potential of cloud-based services.

Log in