Asynchronous Many-Tasking (AMT): Load Balancing, Fault Tolerance, Resource Elasticity

Wednesday, June 1, 2022 1:00 PM to 1:04 PM · 4 min. (Europe/Berlin)

Hall D - 2nd Floor

Information

To enable efficient and productive programming of today's supercomputers and beyond, a variety of issues must be addressed, including: load balancing (i.e., utilizing all resources equally), fault tolerance (i.e., coping with hardware failures), and resource elasticity (i.e., allowing the addition/release of resources).

In this work, we address above issues in the context of Asynchronous Many-Tasking (AMT) for clusters. Here, programmers split a computation into many fine-grained execution units (called tasks), which are dynamically mapped to processing units (called workers) by a runtime system.

Regarding load balancing, we propose a work stealing technique that transparently schedules tasks to resources of the overall system, balancing the workload over all processing units. Experiments show good scalability, and a productivity evaluation shows intuitive use.

Regarding fault tolerance, we propose four techniques to protect programs transparently. All perform localized recovery and continue the program execution with fewer resources. Three techniques write uncoordinated checkpoints of task descriptors in a resilient store. One technique does not write checkpoints, but exploits natural task duplication of work stealing. Experiments show failure-free running time overhead below 1% and a recovery overhead below 0.5 seconds. Simulations of job set executions show that makespans can be reduced by up to 97%.

Regarding resource elasticity, we propose a technique to enable the addition and release of nodes at runtime by transparently relocating tasks accordingly. Experiments show costs for adding and releasing nodes below 0.5 seconds. Additionally, simulations of job set executions show that makespans can be reduced by up to 20%.

Contributors: