Multi-Channel DMA-Accelerated MPI Intra-Node Communication: A Hybrid Adaptive Framework with Memory Copy Offloading

Thursday, June 25, 2026 1:20 PM to 1:40 PM · 20 min. (Europe/Berlin)

Hall E - 2nd Floor

Research Paper

Emerging Computing TechnologiesNetworking and InterconnectsParallel Programming Languages

Information

Modern HPC systems configured with multi-/many-core processor architectures faces more complex memory hierarchies and higher core counts. Scaling HPC applications up on these systems brings new challenges of intra-node communication workloads.
Drivers and libraries support, such as I/OAT, Linux DMA engine API provided to enable the DMA engine to offload data/memory copy workloads from CPU processors in modern HPC systems. Driven by the lack of scalability and performance limitations in the state-of-the-art MPI intra-node communication designs, including shared-memory, kernel-assisted, and DMA-based offloading approaches, we propose an optimized multi-channel DMA-based offloading design with individual channel management. To orchestrate this new design with the existing framework, we integrate it into a new hybrid MPI intra-node communication framework and implement a dynamic deployment logic that can adaptively switch between kernel-assisted and DMA-based offloading data copy designs for different use cases. Based on our performance evaluation across two common HPC architectures, we observe up to 34% communication latency improvements at the microbenchmark level, and up to 12% faster execution in AWP-ODC-CPU application and up to 68% communication/computation overlap in 3D-Stencil mini-application.

Contributors:

Format

on-site