NVIDIA's Quantum network congestion control technology and its impact on application performance

NVIDIA's Quantum network congestion control technology and its impact on application performance

Wednesday, June 1, 2022 9:00 AM to 9:20 AM · 20 min. (Europe/Berlin)
Hall G1 - 2nd Floor
Exascale Systems

Information

Applications running on large scale systems often suffer from degraded performance and lack of reproducible run-times due to network-level congestion, whether caused by the application network traffic itself, or by unrelated background network traffic (i.e other applications). This paper describes the hardware-based congestion control algorithm implemented in NVIDIA's Quantum HDR 200Gb/s InfiniBand generation and the AI-based training used to obtain algorithm parameters. The hardware leverages NVIDIA's Data Center Quantized Congestion Notification (DCQCN) algorithm and protocol and applies it to the InfiniBand network layer. Congestion patterns described in the literature are studied and enhanced to create greater congestion and are used to study the impact of such patterns on three applications: Incompact3D, LAMMPS and VASP. The study shows that network congestion increases individual measured application run time by up to a factor of ten or greater, while introduction of the implemented congestion control on the Quantum HDR InfiniBand technology recovers most of the lost time for the tested applications and congestion.
Contributors:

  • Gerardo Cisneros-Stoianowski (NVIDIA)
  • Richard Graham (NVIDIA)
  • Yong Qin (NVIDIA)
  • Gilad Shainer (NVIDIA)
  • Yuval Shpigelman (NVIDIA)
  • Craig Stunkel (NVIDIA)
Format
On-siteLive-Online