Guidelines for HPC data center monitoring and analytics framework development
Only 2 seats left
Wednesday, June 30, 2021 12:00 PM to 12:35 PM · 35 min. (Africa/Abidjan)
Exascale Systems
Information
Contributors:
Abstract:
Improving the efficiency of HPC data centers and their supercomputers is driving HPC sites to develop monitoring and analytics frameworks to collect and analyze operational data at an ever increasing scale. It appears that every center is developing their own solution, potentially re-inventing the wheel over and over again. Although there is no one-size fits all, as a community we need to learn from each other. There is only a limited number of technologies for each function of a monitoring framework that are being used (for example, for message passing there are mainly three message broker used, Kafka, MQTT, or RabbitMQ) and each site goes through its detailed analysis on why to choose which technology. We should bundle this to provide guidance and best practices and see if there are core technologies we can standardize on. This BoF will present 5 different implementations highlighting the different aspects (components) of a holistic HPC data center monitoring and analytics framework and discussing specific design choices.
- Woong Shin (ORNL)
- Thomas Ilsche (TU Dresden, ZIH)
- Michael Ott (LRZ)
- Rachel Palumbo (ORNL)
- Melissa Romanus Abdelbaky (LBL)
- Torsten Wilde (HPE)
- Keiji Yamamoto (RIKEN)
Abstract:
Improving the efficiency of HPC data centers and their supercomputers is driving HPC sites to develop monitoring and analytics frameworks to collect and analyze operational data at an ever increasing scale. It appears that every center is developing their own solution, potentially re-inventing the wheel over and over again. Although there is no one-size fits all, as a community we need to learn from each other. There is only a limited number of technologies for each function of a monitoring framework that are being used (for example, for message passing there are mainly three message broker used, Kafka, MQTT, or RabbitMQ) and each site goes through its detailed analysis on why to choose which technology. We should bundle this to provide guidance and best practices and see if there are core technologies we can standardize on. This BoF will present 5 different implementations highlighting the different aspects (components) of a holistic HPC data center monitoring and analytics framework and discussing specific design choices.
Speakers
Torsten Wilde
Master System ArchitectHewlett Packard Enterprise, Energy Efficient HPC Working GroupKY
Keiji Yamamoto
Unit LeaderRIKEN R-CCSTI
Thomas Ilsche
Scientific AssistantTenische Universitat DresdenMO
Michael Ott
Senior ResearcherLeibniz Supercomputing CentreRP
Rachel Palumbo
Data Analytic EngineerOak Ridge National LaboratoryMelissa Romanus
Data Management EngineerLawrence Berkeley National Laboratory, USA