Auto-tuning the MPI-IO Performance of Applications by using Predictive Modelling
Information
Parallel I/O is an essential part of scientific applications running on high performance computing systems. Typically, parallel I/O stacks offer many parameters which need to be tuned in order to achieve the best possible I/O performance. Unfortunately, there is no default best configuration of parameters; in practice these differ not only between systems, but often also from one application use-case to the other. However, scientific users do often not have the time nor the experience to explore the parameter space sensibly and choose the right configuration for each application use-case. In this study, an auto-tuning approach based on predictive modeling which can find a good set of I/O parameter values on a given system and application use-case is proposed. The feasibility to auto-tune parameters related to the Lustre filesystem and the MPI-IO ROMIO library transparently to the user is demonstrated. In particular, the model predicts for a given I/O pattern the best configuration from a history of I/O usages. The model has been validated with two I/O benchmarks, namely IOR and MPI-Tile-IO, and a real Molecular Dynamics code, namely ls1 Mardyn.
The I/O performance model estimates I/O performance based on the results of the previous runs. It achieves improvements for write performance in important I/O benchmarks and a real application on supercomputer Vulcan. Thereby, the training time to find best parameters is drastically reduced from several hours (application-dependent) for a naive strategy to only 8.0 seconds (data-dependent) on average. This is an enormous improvement in training time over past models for auto-tuning. An increase of I/O bandwidth by a factor of up to 18 over the default parameters for collective I/O in the IOR, and a factor of up to 5 for the non-contiguous I/O in the MPI-Tile-IO are achieved. Finally, an improvement of check-point writing time over the default parameters of up to 32 in ls1 Mardyn is obtained. It demonstrates that the proposed approach can indeed be useful for I/O tuning of parallel applications in HPC. The I/O model can be trained with negligible effort for any benchmark or I/O application. It uses random forest regression and obtains less than 10% median prediction errors for most cases, even with 80%-20% train/test split where results are averaged by running over 10 different sets of train/test splits. This approach can be understood by users with little knowledge of parallel I/O without any post-processing step. It is implemented upon the MPI-IO library to be compatible with MPI based engineering applications, and be portable to different HPC platforms as well.The parameters discussed in this paper are system dependent, but new parameters can be easily integrated to the configuration files. Future efforts will further explore more input data to feed the model to be able to learn in case of various applications comprehensively and more accurate representations of the configuration parameters and statistical methods.
The I/O performance model estimates I/O performance based on the results of the previous runs. It achieves improvements for write performance in important I/O benchmarks and a real application on supercomputer Vulcan. Thereby, the training time to find best parameters is drastically reduced from several hours (application-dependent) for a naive strategy to only 8.0 seconds (data-dependent) on average. This is an enormous improvement in training time over past models for auto-tuning. An increase of I/O bandwidth by a factor of up to 18 over the default parameters for collective I/O in the IOR, and a factor of up to 5 for the non-contiguous I/O in the MPI-Tile-IO are achieved. Finally, an improvement of check-point writing time over the default parameters of up to 32 in ls1 Mardyn is obtained. It demonstrates that the proposed approach can indeed be useful for I/O tuning of parallel applications in HPC. The I/O model can be trained with negligible effort for any benchmark or I/O application. It uses random forest regression and obtains less than 10% median prediction errors for most cases, even with 80%-20% train/test split where results are averaged by running over 10 different sets of train/test splits. This approach can be understood by users with little knowledge of parallel I/O without any post-processing step. It is implemented upon the MPI-IO library to be compatible with MPI based engineering applications, and be portable to different HPC platforms as well.The parameters discussed in this paper are system dependent, but new parameters can be easily integrated to the configuration files. Future efforts will further explore more input data to feed the model to be able to learn in case of various applications comprehensively and more accurate representations of the configuration parameters and statistical methods.