Hans Meuer Award Winner: A Performance Analysis of Modern Parallel Programming Models Using a Compute-Bound Application

Monday, June 28, 2021 1:15 PM to 1:35 PM · 20 min. (Africa/Abidjan)
Award WinnerChallenges in Programming for Massive ScalePerformance Modeling & TuningProgramming Models & Languages



Performance portability is becoming increasingly important as next-generation high performance computing systems grow increasingly diverse and heterogeneous. Several new approaches to parallel programming have been developed in recent years to tackle this challenge, such as SYCL and Kokkos. While several studies have been published evaluating these new programming models, they have tended to focus on memory-bandwidth bound applications. In this paper we analyse the performance of the most promising modern parallel programming models, on a diverse range of contemporary high-performance hardware, using a compute-bound molecular docking mini-app. We present a mini-app for BUDE, the Bristol University Docking Engine, am application routinely used for drug discovery. We implement the mini-app in different programming models targeting both CPUs and GPUs, including SYCL and Kokkos. We then present an analysis of the performance of each implementation and compare them to highly-optimised baselines set using established programming models such as OpenMP, OpenCL, and CUDA. Our study includes a wide variety of modern hardware platforms covering CPUs based on x86 and Arm architectures, as well as GPUs. We found that, with the emerging higher-level parallel programming models framework such as SYCL, we could achieve performance comparable to that of the established models without hurting either portability or productivity. We identify a set of key challenges and pitfalls to take into account when adopting these emerging programming models, some of which are implementation-specific effects and not fundamental design errors that prevent further adoption. Finally, we discuss our findings in the wider context of performance-portable compute-bound workloads.