Building a Highly Available SLURM Runtime with Scheduler Tuning for CI-Driven HPC Workloads

Building a Highly Available SLURM Runtime with Scheduler Tuning for CI-Driven HPC Workloads

Wednesday, June 24, 2026 3:45 PM to 5:15 PM · 1 hr. 30 min. (Europe/Berlin)
Foyer D-G - 2nd Floor
Project Poster
Resource Management and SchedulingRuntime Systems for HPC

Information

Poster is on display.
HPC clusters today are used for more than traditional scientific workloads. In addition to long-running MPI jobs, many systems now support automated CI pipelines and other short-lived tasks. These newer workloads place different demands on the scheduler and control plane. While SLURM remains widely used in HPC environments, its default setup is not designed for this mixed usage or for strong fault tolerance.


This poster presents an ongoing effort to make SLURM more reliable and better suited for modern HPC environments. The work focuses on building a highly available SLURM control plane that can continue operating even when a controller fails. Redundant controllers and shared state are used to reduce downtime and avoid single points of failure. Automation is used to keep the system consistent and easier to manage in day-to-day operations.


Alongside reliability improvements, this work explores simple scheduler tuning approaches to better support both CI and traditional HPC workloads. CI jobs are usually short and arrive in bursts, while HPC jobs tend to run longer and use more resources. To handle this difference, the scheduler is tuned so short CI jobs can run quickly when resources are available, without delaying larger jobs that have already been planned.


The system described in this poster is still under active development. Rather than presenting performance results, the focus is on design choices and operational behavior. By combining a reliable control plane with workload-aware scheduling, this work aims to show a practical approach for running HPC clusters that must support both established scientific workloads and newer automation-driven tasks.
Format
on-demandon-site

Log in

See all the content and easy-to-use features by logging in or registering!