AresBench: Automated HPC Cluster Validation with Interactive Performance Analytics

AresBench: Automated HPC Cluster Validation with Interactive Performance Analytics

Wednesday, June 24, 2026 3:45 PM to 5:15 PM · 1 hr. 30 min. (Europe/Berlin)
Foyer D-G - 2nd Floor
Project Poster
High-Performance Data AnalyticsHPC in the Cloud and HPC ContainersIndustrial Use Cases of HPC, ML and QCPerformance MeasurementSystem and Performance Monitoring

Information

Poster is on display.
Maintaining performance consistency across large-scale GPU clusters requires systematic validation and intuitive analytics. We present AresBench, an integrated framework combining automated benchmarking with a purpose-built web dashboard for real-time performance monitoring of SLURM-managed HPC systems.
Built on ReFrame, AresBench automates benchmark execution using containerized workflows with NVIDIA Pyxis/Enroot integration. The framework implements HPL for compute validation and NCCL collective operations (AllReduce, AllToAll, AllGather, ReduceScatter) for GPU interconnect characterization, and may others like NvBandwidth, Nemotron etc supporting single-node through multi-node configurations across AWS, Azure, GCP, and OCI environments.
The AresBench UI provides an interactive React-based dashboard featuring: cluster management with real-time status, configurable NCCL operation selection, 30-day performance trend visualization with multiple chart types, and baseline comparison with automatic ±5% regression detection. A detailed results table displays job metadata, CUDA/NCCL/driver versions, and node configurations with date filtering and pagination.
End-to-end CI/CD integration via GitLab pipelines enables on-demand validation with secure Vault credential management. Test results persist to MongoDB/Cosmos DB, bridging automated testing with human-centered analytics.
The architecture separates ReFrame-based test definitions, a Node.js/Express backend with REST APIs, and a Kubernetes-deployed React frontend—enabling independent scaling while maintaining data consistency.
AresBench reduces mean-time-to-detection for infrastructure issues while providing stakeholders with accessible performance insights, supporting new cluster validation, post-update regression detection, and rack-level SNIFF testing workflows.
Keywords: HPC benchmarking, ReFrame, NCCL, GPU clusters, performance visualization, CI/CD automation
Contributors:
Format
on-demandon-site

Log in

See all the content and easy-to-use features by logging in or registering!