Multilevel Scheduling and Load Balancing in Scientific Applications
Thursday, July 1, 2021 1:15 PM to 1:35 PM · 20 min. (Africa/Abidjan)
Stream#1
Exascale SystemsExtreme Heterogeneity
Information
Contributors:
Abstract:
Exascale computing systems will exhibit high degrees of hierarchical parallelism, with thousands of computing nodes and hundreds of (potentially heterogeneous) cores per node. Efficiently exploiting hierarchical parallelism is challenging due to load imbalance that arises at multiple levels: job, process, thread, task, stream. In this talk, we will present a vision for multilevel scheduling and load balancing of scientific applications that fully exploit the available hierarchical parallelism in the HPC systems. We will show that there is a significant interplay between thread and process level load imbalance and that two-level dynamic scheduling and load balancing is needed to achieve the highest performance for applications from cosmology and astrophysics. We will also show how cooperation between the batch scheduler and the application scheduler increases system utilization and reduces job wait times without degrading application performance. We will conclude the talk by outlining the steps needed to integrate these results and deploy multilevel scheduling and load balancing in production HPC environments.
Abstract:
Exascale computing systems will exhibit high degrees of hierarchical parallelism, with thousands of computing nodes and hundreds of (potentially heterogeneous) cores per node. Efficiently exploiting hierarchical parallelism is challenging due to load imbalance that arises at multiple levels: job, process, thread, task, stream. In this talk, we will present a vision for multilevel scheduling and load balancing of scientific applications that fully exploit the available hierarchical parallelism in the HPC systems. We will show that there is a significant interplay between thread and process level load imbalance and that two-level dynamic scheduling and load balancing is needed to achieve the highest performance for applications from cosmology and astrophysics. We will also show how cooperation between the batch scheduler and the application scheduler increases system utilization and reduces job wait times without degrading application performance. We will conclude the talk by outlining the steps needed to integrate these results and deploy multilevel scheduling and load balancing in production HPC environments.