Project Wagomu: Elastic HPC Resource Management

Project Wagomu: Elastic HPC Resource Management

Monday, May 13, 2024 3:00 PM to Wednesday, May 15, 2024 4:00 PM · 2 days 1 hr. (Europe/Berlin)
Foyer D-G - 2nd floor
Project Poster
Parallel Programming LanguagesResource Management and SchedulingRuntime Systems for HPC

Information

Poster is on display.
Traditionally, hardware resource management in supercomputers has been static: jobs request a predetermined set of resources and retain them for the duration of their execution. Elastic resource management introduces dynamic control over resources, significantly improving the efficiency of supercomputers. This approach requires support for both the resource manager and the program to ensure seamless interaction. As a result, running jobs can adjust their resource allocation by either integrating or releasing resources initiated by either party. Project Wagomu envisions a novel, user-friendly, holistic approach that addresses both sides, including their interaction. On the program side, we develop resource elasticity techniques within a prototypical Asynchronous Many-Task (AMT) runtime system that operate automatically and transparently, i.e., no user code changes are required. AMT decomposes problems into small, independent execution units, or tasks, which are dynamically assigned to processors. Tasks can generate new tasks at runtime, accommodate irregular computation patterns, and access global data. AMT applications inherently adapt to resource variations initiated by the resource manager, autonomously initiate their own resource adjustments, and can be structured into distinct phases to adapt to fluctuating resource availability. On the resource management side, we develop novel elastic job scheduling algorithms. After evaluation through simulation, these algorithms will be integrated into a prototype job scheduler. This scheduler is designed to take into account various factors, including non-computational resources such as I/O bandwidth. It will enable elastic job scheduling, facilitate the initiation and management of resource adjustments, and competently handle resource adjustments initiated by jobs.
Contributors:
Format
On-site