Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters

Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters

Tuesday, May 31, 2022 4:35 PM to 4:55 PM · 20 min. (Europe/Berlin)
Hall G1 - 2nd Floor

Information

As High-Performance Computing (HPC) and Deep Learning (DL) applications are adapting to scale using GPUs, the communication of GPU-resident data plays a vital role in end-to-end application performance. Among the MPI operations in those applications, All-to-All is one of the most communication-intensive operations. Over the last decade, a lot of researchers focused on the optimization of large GPU-resident data transfers. However, for state-of-the-art GPU-Aware MPI libraries, All-to-All communication for large GPU data still suffers from poor performance due to the throughput limitation of commodity networks. The recent research on point-to-point-based online compression with GPU-based compression algorithms reduces the volume of data transferred and has shown performance benefits on modern GPU clusters.

In this paper, we redesign the MPI library to enable efficient collective level online compression with an optimized host-staging scheme for All-to-All communication. The proposed design achieves benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the All-to-All communication latency by up to 87%. For PSDNS, a traditional HPC application, our proposed design can reduce the All-to-All communication latency and total runtime by up to 29.2% and 21.8%, respectively, while ensuring data validation. For DeepSpeed, a DL optimization library, the proposed design reduces the All-to-All runtime by up to 26.4% compared to the state-of-the-art MPI library with point-to-point-based compression while ensuring data validation. To the best of our knowledge, this is the first work that leverages the online GPU-based compression techniques to significantly accelerate the All-to-All communication for HPC and DL applications.
Contributors:

  • Quentin Anthony (The Ohio State University)
  • Pouya Kousha (The Ohio State University)
  • Dhabaleswar K. Panda (The Ohio State University)
  • Aamir Shafi (The Ohio State University)
  • Kawthar Shafie Khorassani (The Ohio State University)
  • Hari Subramoni (The Ohio State University)
  • Qinghua Zhou (The Ohio State University)
Format
On-siteLive-Online

Log in