Auto-TBC: Auto-Optimizing CUDA Applications Using Thread Block Clusters

Auto-TBC: Auto-Optimizing CUDA Applications Using Thread Block Clusters

Monday, May 13, 2024 3:00 PM to Wednesday, May 15, 2024 4:00 PM · 2 days 1 hr. (Europe/Berlin)
Foyer D-G - 2nd floor
Research Poster
Compiler and Tools for Parallel ProgrammingHeterogeneous System ArchitecturesMemory Technologies and HierarchiesPerformance and Resource ModelingRuntime Systems for HPC

Information

Poster is on display and will be presented at the poster pitch session.
A GPU kernel is launched using a grid of threads, organized into thread blocks. Threads in a thread block share data using a Streaming Multiprocessors (SM) shared memory. As the number of SMs grew severalfold over the years, a thread block as the only unit to express data locality was not deemed sufficient. Therefore, in the Hopper architecture, NVIDIA introduced a new optional level of hierarchy in the CUDA programming model called thread block clusters (TBC). All thread blocks in a cluster are guaranteed to be concurrently scheduled on a group of SMs called Graphics Processing Clusters (GPC). Thread blocks in a cluster can access any SM's shared memory using an SM-to-SM network. As TBC is a recent addition to the CUDA programming model, its performance benefits are unknown. Therefore, we optimize several benchmarks by using TBC and evaluate the possible performance gain. Our results show that we can achieve a speedup of up to 3.6×, with an average improvement of 1.67× across all several benchmarks. We also investigate the impact of different SM-to-SM networks on performance. While TBC can provide a performance boost, it also requires significant effort to modify benchmarks. To address this, we develop a source-to-source transformation tool (Auto-TBC) that can automatically optimize benchmarks using TBC. The automatically optimized benchmarks provide a speedup of up to 1.9× and an average speedup of 1.1×, offering a more accessible path to harnessing the benefits of TBC for a wide range of benchmarks.
Contributors:
Format
On-site