Democratizing domain-specific architectures for sparse matrix multiplication
Only 126 seats left
Monday, June 28, 2021 3:00 PM to 4:00 PM · 1 hr. (Africa/Abidjan)
Information
Sparse matrices, whether they represent the graphs of social science or the pruned layers of deep neural networks, are growing as fast as the number of researchers studying them. As these objects gain size, the necessity arises of manipulating them efficiently on high-performance parallel systems. Sparse matrix multiplication is a fundamental block of many such applications, but its optimization on parallel machines remains an open challenge.
Domain-specific parallel hardware for matrix multiplication, such as Nvidia tensor cores, is designed to multiply dense matrices and is not suited to deal with the irregular structure of sparse matrices. I am investigating the possibility of bridging this gap by reordering the rows of sparse matrices to cluster their non-zero entries into smaller, denser blocks. In this way, elements inside of blocks can be multiplied efficiently, while elements outside of blocks are null and do not contribute to the multiplication. In general, my research aims to characterize and benchmark several implementations of sparse multiplication as a collection of block multiplications, to precisely identify the architectures and the matrices for which it can be effective.
Preliminary results suggest that a dense-block approach can be competitive for a variety of matrix structures, even when relying on high-level dense-dense multiplication libraries. In particular, the poster presents the results of a comparison between the Nvidia cuSparse parallel multiplication routine and a block-based parallel routine using Nvidia cuBLAS to multiply blocks. The comparison shows that this approach can be already effective on matrices with a density of less than 1%.
Of course, many solutions already exist that attempt to multiply sparse matrices efficiently on parallel architectures. These solutions either optimize the multiplication starting from an existing, popular data structure such as CSR or define a custom data structure with a custom multiplication routine. Instead, I am searching for efficient solutions based on standard, dense-dense multiplication routines. This would allow to repurpose dense-specific architectures (e.g. tensor cores) for sparse multiplication, and also reduce the need for custom multiplication routines for sparse formats.
Domain-specific parallel hardware for matrix multiplication, such as Nvidia tensor cores, is designed to multiply dense matrices and is not suited to deal with the irregular structure of sparse matrices. I am investigating the possibility of bridging this gap by reordering the rows of sparse matrices to cluster their non-zero entries into smaller, denser blocks. In this way, elements inside of blocks can be multiplied efficiently, while elements outside of blocks are null and do not contribute to the multiplication. In general, my research aims to characterize and benchmark several implementations of sparse multiplication as a collection of block multiplications, to precisely identify the architectures and the matrices for which it can be effective.
Preliminary results suggest that a dense-block approach can be competitive for a variety of matrix structures, even when relying on high-level dense-dense multiplication libraries. In particular, the poster presents the results of a comparison between the Nvidia cuSparse parallel multiplication routine and a block-based parallel routine using Nvidia cuBLAS to multiply blocks. The comparison shows that this approach can be already effective on matrices with a density of less than 1%.
Of course, many solutions already exist that attempt to multiply sparse matrices efficiently on parallel architectures. These solutions either optimize the multiplication starting from an existing, popular data structure such as CSR or define a custom data structure with a custom multiplication routine. Instead, I am searching for efficient solutions based on standard, dense-dense multiplication routines. This would allow to repurpose dense-specific architectures (e.g. tensor cores) for sparse multiplication, and also reduce the need for custom multiplication routines for sparse formats.