Distributed Training of Deep Neural Networks
Sunday, May 12, 2024 2:00 PM to 6:00 PM · 4 hr. (Europe/Berlin)
Hall Y1 - 2nd floor
Tutorial
AI Applications powered by HPC TechnologiesML Systems and Tools
Information
Deep learning (DL) is rapidly becoming pervasive in almost all areas of
computer science and is even being used to assist computational science
modeling and simulations. The nowadays accepted (but historically surprising)
key behavior of these systems is that they reliably scale i.e. they
continuously improve in performance when the number of model parameters and
amount of data grow. As the demand for larger, more sophisticated, and more
accurate AI models increases, the need for large-scale parallel model training
has become increasingly pressing. Subsequently, in the past few years, several
parallel algorithms and frameworks have been developed to parallelize model
training on GPU-based platforms. This tutorial will introduce and provide
basics of the state-of-the-art in distributed deep learning. We will use a
toy neural network as a running example, gradually scaling its number of
parameters (from tens of millions to billions), and simultaneously introducing
the techniques that are used to train at the corresponding scale. We will cover
algorithms and frameworks falling under the purview of data parallelism
(PytorchDDP and DeepSpeed), tensor parallelism (MegatronLM) and pipeline
parallelism (AxoNN).
Contributors:
Contributors:
Format
On-site
Targeted Audience
This tutorial targets a broad audience, from basic to advanced users of deep
learning who want to train their models on tens to thousands of GPUs.
Beginner Level
60%
Intermediate Level
40%
Prerequisites
Attendees should have basic familiarity with using Python and running Python
programs. Familiarity with running parallel programs on HPC clusters will be a
plus.