Intelligent Job-Centric Monitoring Framework: Accelerating HPC Troubleshooting with AI-Powered Insights

Intelligent Job-Centric Monitoring Framework: Accelerating HPC Troubleshooting with AI-Powered Insights

Wednesday, June 24, 2026 3:45 PM to 5:15 PM · 1 hr. 30 min. (Europe/Berlin)
Foyer D-G - 2nd Floor
Women in HPC Poster
Application Workflows for DiscoveryHigh-Performance Data AnalyticsIndustrial Use Cases of HPC, ML and QCLarge Language Models and Generative AI in HPCSystem and Performance Monitoring

Information

Poster is on display and will be presented at the poster pitch session.
High performance computing (HPC) systems at exascale generate an enormous volume of telemetry that comprehends the states of network interconnects, node-level metrics, system events, and workload manager data. However, the traditional approach of monitoring these data streams in isolation often impedes the ability of researchers and administrators to rapidly diagnose and resolve performance bottlenecks, resulting in lost productivity and underutilized resources. Addressing this challenge, our work introduces a pioneering job-centric monitoring framework that unifies and correlates system-wide telemetry with workload data, providing actionable insights into the interactions between applications and the underlying HPC infrastructure. This poster showcases the achievements of a women-led team that transformed user pain points into a focused, practical solution, and empowering users to troubleshoot efficiently.

Conventional monitoring frameworks typically require operators to manually cross-reference disparate logs and metrics, a time-consuming and error-prone process that complicates root cause analysis. Our framework disrupts this paradigm by automatically aligning a rich variety of interconnect telemetry, spanning nearly 400 distinct metrics and events, with individual job execution windows and resource allocations. By directly mapping fabric-level telemetry to specific workloads, our framework isolates crucial information from vast raw data, surfacing only the telemetry most relevant to each job’s resource footprint during runtime. This dual-axis correlation—temporal (aligned to job execution) and spatial (filtered by allocated infrastructure)—enables users to observe, in real time, how factors like network congestion, hardware anomalies, or competing workloads directly impact their applications.

Currently, our team is integrating AI-powered root cause analysis into the framework. By leveraging large language models (LLMs) augmented with system-specific context and domain knowledge, the platform will transform distributed telemetry and logs into concise, structured, and actionable intelligence. This approach is made feasible on commodity hardware through the use of quantized pre-trained models and efficient inference pipelines, ensuring scalability and accessibility for HPC centers.

In summary, our work bridges the longstanding gap between system behavior and workload performance in exascale HPC environments. By delivering robust, data-driven insights in an accessible format, this initiative enables users to more effectively leverage HPC environments, spending less time deciphering infrastructure complexity and more time advancing the frontiers of scientific discovery. The framework sets out a new standard for proactive, intelligent performance management in HPC, supporting women in leadership and research communities, and productivity in computational science.
Contributors:
Format
on-demandon-site

Log in

See all the content and easy-to-use features by logging in or registering!