The Best Thread Block Size and other parameters you have to tune for optimal performance on GPUs


Optimal performance on GPUs often requires careful selection of tuning parameters like the thread block size. This parameter space can be especially large when code generation is used. A common approach to find the best configuration is auto tuning. However, the generate/compile/benchmark procedure makes an exhaustive exploration of the large decision space a slow process. We propose using a performance model that can be quickly evaluated, instead of actually measuring the execution of the program to speed up the selection process. To ensure evaluation speed, we do not require invoking the compiler or even to generate the source code. Instead, the model is built only on high level code features. We show how to compute intermediate hardware metrics like DRAM or cache balances from these high level features and that feeding these metrics into a simple performance model in the style of the Roofline model can distinguish badly performing and well performing configurations with a high degree of certainty, and also selects a best performing configuration that achieves 96% of the performance of the actual best configuration.

Log in