SU3\_Bench on Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs From Standard NUMA CPUs
Wednesday, June 1, 2022 9:40 AM to 10:00 AM · 20 min. (Europe/Berlin)
Hall G1 - 2nd Floor
HPC Workflows
Information
SU3\_Bench explores performance portability across multiple programming models using a simple but nontrivial mathematical kernel. This kernel has been derived from the \b{L}attice \b{Q}uantum \b{C}hromo\b{d}ynamics (LQCD) code used in applications such as Hadron {P}hysics and hence should be of interest to the scientific community.
SU3\_Bench has a regular compute and data access pattern and on most traditional CPU and GPU-based systems, its performance is mainly determined by the achievable memory bandwidth. However, this paper shows that on the new Intel Programmable Integrated Unified Memory Architecture (PIUMA) that is designed for sparse workloads and has a balanced flops-to-byte ratio with scalar cores, SU3\_Bench's performance is determined by the total number of instructions that can be executed per cycle (pipeline throughput) rather than the usual bandwidth or flops. We show the performance analysis, porting, and optimizations of SU3\_Bench on the PIUMA architecture and discuss how they are different from the standard NUMA CPUs (e.g., Xeon required NUMA optimizations whereas, on PIUMA, it was not necessary). We show iso-bandwidth and iso-power comparisons of SU3\_Bench for PIUMA vs Xeon. We also show performance efficiency comparisons of SU3\_Bench on PIUMA, Xeon, GPUs, and FPGAs based on pre-existing data. The lessons learned are generalizable to other similar kernels.
Contributors:
SU3\_Bench has a regular compute and data access pattern and on most traditional CPU and GPU-based systems, its performance is mainly determined by the achievable memory bandwidth. However, this paper shows that on the new Intel Programmable Integrated Unified Memory Architecture (PIUMA) that is designed for sparse workloads and has a balanced flops-to-byte ratio with scalar cores, SU3\_Bench's performance is determined by the total number of instructions that can be executed per cycle (pipeline throughput) rather than the usual bandwidth or flops. We show the performance analysis, porting, and optimizations of SU3\_Bench on the PIUMA architecture and discuss how they are different from the standard NUMA CPUs (e.g., Xeon required NUMA optimizations whereas, on PIUMA, it was not necessary). We show iso-bandwidth and iso-power comparisons of SU3\_Bench for PIUMA vs Xeon. We also show performance efficiency comparisons of SU3\_Bench on PIUMA, Xeon, GPUs, and FPGAs based on pre-existing data. The lessons learned are generalizable to other similar kernels.
Contributors:
- Fabio Checconi (Intel Corporation)
- Douglas Doerfler (Lawrence Berkeley National Laboratory)
- Fabrizio Petrini (Intel Corporation)
- Jesmin Jahan Tithi (Intel Corporation)
Format
On-siteLive-Online