Non-Volatile Memory for Exact-State-Reconstruction of Preconditioned Conjugate Gradient in Supercomputers

Non-Volatile Memory for Exact-State-Reconstruction of Preconditioned Conjugate Gradient in Supercomputers

Wednesday, June 1, 2022 1:04 PM to 1:08 PM · 4 min. (Europe/Berlin)
Hall D - 2nd Floor
Exascale Systems

Information

HPC systems are a critical resource for scientific research and advanced industries. The demand for computational power and memory is increasing and ushers in the exascale era, in which supercomputers are designed to provide enormous computing power to meet these needs. These complex supercomputers consist of many compute nodes and are consequently expected to experience frequent faults and crashes. Exact state reconstruction (ESR) has been proposed as a mechanism to alleviate the impact of frequent failures on long-term computations. ESR has shown great potential in the context of iterative linear algebra solvers, a key building block in numerous scientific applications.

Recent designs of supercomputers feature the emerging nonvolatile memory (NVM) technology. For example, the Exascale Aurora supercomputer is planned to integrate Intel Optane DCPMM. This work investigates how NVM can be used to improve ESR so that it can scale to future exascale systems such as Aurora and provide enhanced resilience.

We propose the non-volatile memory ESR (NVM-ESR) mechanism. NVM-ESR demonstrates how NVM can be utilized in supercomputers for enabling efficient recovery from faults while requiring significantly smaller memory footprint and time overheads in comparison to ESR. We focus on the preconditioned conjugate gradient (PCG) iterative solver also studied in prior ESR research, because it is employed by the representative HPCG scientific benchmark.
Contributors:

  • Yehonatan Fridman (Department of Computer Science, Ben-Gurion University of the Negev)
  • Hagit Attiya (Department of Computer Science, Technion – Israel Institute of Technology)
  • Danny Hendler (Department of Computer Science, Ben-Gurion University of the Negev)
  • Harel Levin (Scientific Computing Center, Nuclear Research Center – Negev)
  • Gal Oren (Department of Computer Science, Technion – Israel Institute of Technology)
  • Yaniv Snir (Department of Computer Science, Ben-Gurion University of the Negev)
Format
On-site