Hy-Fi: Hybrid Five Dimensional Parallel DNN Training on High-Performance GPU Clusters

Tuesday, May 31, 2022 9:00 AM to 9:20 AM · 20 min. (Europe/Berlin)

Hall G1 - 2nd Floor

Information

Recent advances in the High Performance Computing (HPC) hardware is enabling complex Deep Learning (DL) models to achieve state-of-the-art performance by exploiting multiple processing elements, such as GPUs, concurrently. Data parallelism is a widely adopted parallelization strategy that achieves impressive performance but maintains a copy of the entire DL model on every processing element, which is not possible for models like GPT3 and AmoebaNet on the modern NVIDIA GPUs. Layer parallelism, or inter-layer model parallelism, avoids this overhead by splitting the model into partitions with one or more layers that can be executed concurrently. This approach, however, runs into limitations when training DNN models like AmoebaNet with high-resolution images. We propose Hy-Fi: Hybrid Five-Dimensional Parallelism; a system that takes advantage of five parallelism dimensions---data, model, spatial, pipeline, and bi-directional parallelism---and enables efficient distributed training of out-of-core models and layers. Hy-Fi also proposes communication-level optimizations to integrate these dimensions. We report up to 2.6X and 1.68X speedups over layer and pipeline parallelism, respectively. We demonstrate the benefits of our proposed designs up to $2,048$ GPUs on AmoebaNet and ResNet models. Towards the end of the paper, we use Hy-Fi to enable DNN training on high-resolution images including 8,192*8,192 and 16,384*16,384.

Contributors: