Improving the efficiency of HPC data centers and their supercomputers is driving HPC sites to develop monitoring and analytics frameworks to collect and analyze operational data at an ever increasing scale. It appears that every center is developing their own solution, potentially re-inventing the wheel over and over again. Although there is no one-size fits all, as a community we need to learn from each other. There is only a limited number of technologies for each function of a monitoring framework that are being used (for example, for message passing there are mainly three message broker used, Kafka, MQTT, or RabbitMQ) and each site goes through its detailed analysis on why to choose which technology. We should bundle this to provide guidance and best practices and see if there are core technologies we can standardize on.
This BoF will present 5 different implementations highlighting the different aspects (components) of a holistic HPC data center monitoring and analytics framework and discussing specific design choices.