MAWA-HPC: Modular and Automated  Workload Analysis for HPC Systems

MAWA-HPC: Modular and Automated Workload Analysis for HPC Systems

Monday, May 22, 2023 3:00 PM to Wednesday, May 24, 2023 5:00 PM · 2 days 2 hr. (Europe/Berlin)
Foyer D-G - 2nd Floor
Research Poster
Extreme HeterogeneityHPC WorkflowsMemory and Storage TechnologyPerformance Modeling and TuningResource Disaggregation

Information

Given the complexity of modern HPC systems, achieving theoretical peak performance depends on a myriad of parameters and system configurations. In order to optimize the system performance and efficiently use the underlying resources, various methods can be applied, including simulation, benchmarking, and monitoring. However, these methods and the tools are not compatible with each other and only consider a selection of performance factors such network, I/O, resource allocation, or parallel execution. Yet, each of these approaches generate knowledge that can be applied to similar problems or system configurations. To avoid that such knowledge is collected only for one-time purposes, and to also support other users, this knowledge must be easily accessible and available to the community. The MAWA-HPC (Modular and Automated Workload Analysis for HPC Systems) project aims to develop a generic workflow and tool suite that can be applied to different use cases and workloads from different science domains. Through its modular design, the workflow is able to support various community tools, increasing the compatibility of each tool and covering new use cases. In this poster, we present the high-level system design of MAWA-HPC and its current prototype implementation. By extending the prototype with the support for additional monitoring and profiling tools, node-level performance engineering tools, network benchmarks, and microbenchmarks for different parallel programming models, we also introduce a multi-dimensional Roofline model. Including time as a third dimension, the Roofline model can provide insight into an application's performance over time, enabling the identification and understanding of performance anomalies.
Contributors:
Format
On-site
Beginner Level
30%
Intermediate Level
40%
Advanced Level
30%

Log in