OpenMP has a long and successful history in parallel programming for CPUs. Since the introduction of accelerator offloading it has evolved into a promising candidate for all intra-node parallel computing needs. While this addition broke with the shared memory assumption OpenMP was initially developed with, efforts to employ OpenMP beyond shared-memory domains are practically non-existent.
In this work we show that the OpenMP accelerator offloading model is sufficient to seamlessly and efficiently utilize more than a single compute node, and its connected accelerators. Without source code or compiler modifications we run an OpenMP offload capable program on a remote CPU, or remote accelerator (e.g., GPU), as if it was a local one. For applications that support multi-device offloading, any combination of local and remote CPUs and accelerators can be utilized simultaneously, fully transparent to the user. Our low-overhead implementation of Remote OpenMP Offloading is integrated into the LLVM/OpenMP compiler infrastructure and publicly available (in parts) with LLVM 12 and later. LLVM-based (vendor) compilers are expected to be compatible out-of-the-box as well.
To evaluate our work we provide detailed studies on micro benchmarks, as well as scaling results for two HPC proxy applications. We show perfect scaling across dozens of GPUs in multiple hosts with effectiveness that is directly proportional to the ratio of computation versus memory transfer time. Our work outlines the capabilities and limits of OpenMP 5.1 to efficiently utilize a distributed heterogeneous system without source, compiler, or language modifications, as opposed to other solutions such as MPI.