Analytic, first-principles performance modeling of distributed-memory applications is notoriously imprecise due to a wide spectrum of random disturbances in the application and the system. Even for applications with extremely regular and homogeneous compute-communicate phases, simply adding communication time to computation time does often not give a satisfactory prediction of parallel runtime due to deviations from the expected simple lockstep pattern. A comprehensive quantitative understanding of such effects is not available. In my PhD work I aim at developing a comprehensive theory about how disturbances from different sources travel and interact through a parallel application on a cluster, causing collective phenomena that break the inherent symmetry of the underlying software and hardware.
I have developed a validated analytic model for the propagation speed of “idle waves,” i.e., propagating phases of inactivity across the MPI processes emanating from injected delays on individual processes. I also study how such delays interact with each other and with fine-grained noise. There are two mechanisms of idle wave decay: topological decay due to differences in communication characteristics among parts of the system, and noise-induced decay due to system or application noise. I could show that the noise-induced decay rate of idle waves only depends on the noise power and not on the statistical properties of the noise. Although it may seem that MPI collective functions put an end to a traveling idle wave, I could show that some implementation variants of collectives are permeable to the wave. These results contribute to a better understanding of the collective, nonlinear phenomena connected to idle waves.
Another focus of my research is provoked and spontaneous desynchronization of bottleneck-bound, bulk-synchronous programs. In presence of a bottleneck (e.g., memory bandwidth), even programs with no load imbalance and a perfect translational symmetry across processes end up in a desynchronized state, which we call “computational wavefront.” Although this can cause increased waiting time per process, it can also boost the available bandwidth per core and provide automatic communication overlap, improving the overall time to solution. The number of processes per memory domain required to achieve full memory bandwidth is important for the emerging stable wave pattern.
In a desynchronized state, different cores can execute different loops, which is also true for task-based programming models. I have devised a performance model for overlapping execution of memory-bound loop kernels that can predict the memory bandwidth share per kernel on a memory contention domain depending on the number of active cores and which other workload the kernel is paired with. The only code features required are the single-thread cache line access frequency per kernel, which is directly related to the single-thread memory bandwidth, and its saturated bandwidth.
In ongoing work, I develop a parallel program simulator for large-scale applications that takes the socket-level properties of the application into account. It serves as a tool to explore these dynamics further in a controlled environment. I am also currently investigating how a parallel program can be characterized by a chain of coupled oscillators via a modified Kuramoto model.