

A Dynamic GPU Power Prediction Framework for Production HPC Applications
Thursday, June 25, 2026 9:00 AM to 9:20 AM · 20 min. (Europe/Berlin)
Hall E - 2nd Floor
Research Paper
Energy Efficiency and SustainabilityML Systems and Frameworks
Information
As exascale systems increasingly rely on hardware overprovisioning to sustain throughput under strict power limits, accurate, fine-grained power prediction becomes critical. We present a machine learning framework leveraging lightweight, always-on NVIDIA DCGM telemetry from the Perlmutter supercomputer to predict runtime GPU power consumption, designed with real-time deployment constraints in mind. Using a month-long production telemetry dataset for training and evaluation, we demonstrate that a generic, application-agnostic model achieves an average Root Mean Squared Error (RMSE) of 50 W. Specialized application-aware models for dominant workloads (e.g., Chroma, VASP) further reduce prediction error by up to 48% compared to generic models, while combining DCGM metrics with historical power data improves accuracy by up to 15.8% over power-only models with negligible inference latency. Our framework provides a scalable building block for dynamic power management systems that allow the hardware-overprovisioned systems to reclaim performance otherwise lost to static power caps.
Contributors:
Contributors:
Format
on-site
Documents & Links
Read the Full Paper Open Access at IEEE Xplore!


