Automatic Detection of Pathological Jobs for HPC User Support (PathoJobs)

Automatic Detection of Pathological Jobs for HPC User Support (PathoJobs)

Monday, May 13, 2024 3:00 PM to Wednesday, May 15, 2024 4:00 PM · 2 days 1 hr. (Europe/Berlin)
Foyer D-G - 2nd floor
Project Poster
Performance MeasurementSystem and Performance Monitoring

Information

Poster is on display.
The "Automatic Detection of Pathological Jobs for HPC User Support" is developing an automated rule-based detection system for pathological HPC jobs. Examples for pathological jobs are over-parallelization, software pipelines with a decreasing degree of parallelism, or a lack of process binding. Detection of pathological jobs are defined by parametrizable rules, which can be extended and specified by HPC experts. The detection system is supporting “action templates” that describe countermeasures for a specific pathological job category. The detection system supports automated actions, such as direct notification of HPC users and forwarding of completed action templates to assist in mitigating the detected issue. Key goals are production use of this system at German NHR facilities, currently prototyped and tested at contributing NHR sites.
Contributors:
Format
On-site