Distributed Training of Deep Neural Networks

Sunday, May 12, 2024 2:00 PM to 6:00 PM · 4 hr. (Europe/Berlin)

Hall Y1 - 2nd floor

Tutorial

AI Applications powered by HPC TechnologiesML Systems and Tools

Information

Deep learning (DL) is rapidly becoming pervasive in almost all areas of computer science and is even being used to assist computational science modeling and simulations. The nowadays accepted (but historically surprising) key behavior of these systems is that they reliably scale i.e. they continuously improve in performance when the number of model parameters and amount of data grow. As the demand for larger, more sophisticated, and more accurate AI models increases, the need for large-scale parallel model training has become increasingly pressing. Subsequently, in the past few years, several parallel algorithms and frameworks have been developed to parallelize model training on GPU-based platforms. This tutorial will introduce and provide basics of the state-of-the-art in distributed deep learning. We will use a toy neural network as a running example, gradually scaling its number of parameters (from tens of millions to billions), and simultaneously introducing the techniques that are used to train at the corresponding scale. We will cover algorithms and frameworks falling under the purview of data parallelism (PytorchDDP and DeepSpeed), tensor parallelism (MegatronLM) and pipeline parallelism (AxoNN).

Contributors:

Siddharth Singh

Format

On-site

Targeted Audience

This tutorial targets a broad audience, from basic to advanced users of deep learning who want to train their models on tens to thousands of GPUs.

Beginner Level

60%

Intermediate Level

40%

Prerequisites

Attendees should have basic familiarity with using Python and running Python programs. Familiarity with running parallel programs on HPC clusters will be a plus.