Design and Implementation of Multi-Rail-Aware Hierarchical MPI Reduce-Scatter and Allgather Operations

Thursday, June 25, 2026 1:40 PM to 2:00 PM · 20 min. (Europe/Berlin)

Hall E - 2nd Floor

Research Paper

Novel AlgorithmsRuntime Systems for HPC

Information

Collective communication operations, particularly Reduce-Scatter and Allgather, have become critical performance bottlenecks for large-scale Distributed Deep Learning and High-Performance Computing applications. While modern supercomputers employ dense nodes with multi-rail network interfaces to maximize bandwidth, generic MPI implementations often utilize flat, CPU-centric algorithms that fail to exploit this complex hardware topology. In this paper, we propose a systematic optimization framework for efficient Reduce-Scatter and Allgather on multi-rail GPU systems. We introduce novel hierarchical designs that leverage customized GPU kernels to perform direct data placement, completely eliminating intermediate staging buffers and copy overhead. To further maximize throughput, we apply pipelining strategies that overlap intra-node phases with inter-node communication to effectively hide network latency. Experimental results scaling up to 512 GPUs on Delta-AI (NVIDIA GH200), along with evaluations on MareNostrum 5 (256 NVIDIA H100 GPUs) and Cosmos (128 AMD MI300A APUs), confirm that our approach surpasses state-of-the-art GPU-aware MPI and vendor-optimized libraries. Specifically, micro-benchmarks show up to 17.3x and 8.9x latency reductions for Reduce-Scatter and Allgather, respectively. At the application level, nanoGPT training with Fully Sharded Data Parallelism achieves a 2.2x speedup in iteration time at scale, validating the portability and efficiency of our designs across both NVIDIA and AMD architectures.

Contributors:

Format

on-site