Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements

Tuesday, May 31, 2022 9:40 AM to 10:00 AM · 20 min. (Europe/Berlin)

Hall G1 - 2nd Floor

Information

Frameworks for Distributed Deep Learning (DDL) have become popular alternatives to distribute training by adding a few lines of code to a single-node script. From a High-Performance Computing (HPC) perspective, traditional profiling tools for researches in Machine Learning (ML) fail to expose details about distributed training performance, such as identifying synchronization points, communication and computing time, and devices usage throughout the training. Moreover, these results are usually considered independently. We present a methodology for performance analysis of DDL frameworks that combines HPC and ML tools to apply intrusive and non-intrusive tracing to enrich the findings for a strong scaling in three clusters with different GPU models. We selected two modern DDL frameworks: Horovod and Tarantella. Using spatial and temporal analysis, we identify bottlenecks in the frameworks, such as a long initialization time for Horovod, the non-distribution of data during the testing phase for Tarantella. We extract performance measurements using temporal aggregation considering the training phases, which can benefit DDL frameworks' developers to improve their tools. Horovod presented the best scaling efficiency for 4 GPUs or more, with up to 84.6% scaling efficiency for 4 GPUs and large batch size, while Tarantella achieves 54.7% for the same case. Using our temporal aggregation approach, we identified this result origins from Horovod processing an epoch faster than Tarantella.

Contributors: