Distributed Training of Deep Neural Networks

Distributed Training of Deep Neural Networks

Sunday, May 21, 2023 2:00 PM to 6:00 PM · 4 hr. (Europe/Berlin)
Hall Y10 - 2nd Floor
Tutorial
AI ApplicationsML Systems and Tools

Information

Deep learning (DL) is rapidly becoming pervasive in almost all areas of computer science and is even being used to assist computational science modeling and simulations. The nowadays accepted (but historically surprising) key behavior of these systems is that they reliably scale i.e.~they continuously improve in performance when the number of model parameters and amount of data grow. As the demand for larger, more sophisticated, and more accurate AI models increases, the need for large-scale parallel model training has become increasingly pressing. Subsequently, in the past few years, several parallel algorithms and frameworks have been developed to parallelize model training on GPU-based platforms. This tutorial will introduce and provide basics of the state-of-the-art in distributed deep learning. We will use a ``toy'' neural network as a running example, gradually scaling its number of parameters (from tens of millions to billions), and simultaneously introducing the techniques that are used to train at the corresponding scale. We will cover algorithms and frameworks falling under the purview of data parallelism (PytorchDDP and DeepSpeed), tensor parallelism (MegatronLM) and pipeline parallelism (AxoNN).
Format
On-site
Targeted Audience
This tutorial targets a broad audience, from basic to advanced users of deep learning who want to train their models on tens to thousands of GPUs.
Prerequisites
Attendees should bring their own laptop.
Beginner Level
50%
Intermediate Level
35%
Advanced Level
15%