BEST POSTER FINALIST: MPI4Dask: Efficient MPI-Based Communication for Scalable Accelerated Dask Applications
Monday, May 22, 2023 3:00 PM to Wednesday, May 24, 2023 5:00 PM · 2 days 2 hr. (Europe/Berlin)
Foyer D-G - 2nd Floor
Research Poster
Big Data AnalyticsManaging Extreme-Scale ParallelismML Systems and ToolsNumerical Libraries
Information
Message Passing Interface (MPI) is utilized to harness the computing power of High-Performance Computing (HPC) platforms, especially in compute-intensive domains at data science backend. Dask, as one of the most popular Big Data frameworks in data science area, requires the large scalability and high-throughput properties that the state-of-the-art MPI library guarantees in heterogeneous scenarios. Toward this, we propose MPI4Dask, the first MPI-based high-performance point-to-point communication coroutines for Python-based HPC applications. MPI4Dask implements MPI-based Dask communication with asynchronous Distributed library on CPU and GPU scenarios, outperforming UCX by about 5 times in both latency and throughput. With the latest MVAPICH2 2.3.7 library, we further demonstrate the performance benefits of using MPI4Dask compared to TCP and UCX in various commonly used data science applications, where our framework runs 1.3x and 3.4x faster than UCX on TACC-Frontera CPU cluster and Cambridge Wilkes-3 GPU cluster respectively. In the scalability evaluation, we scale out to hundreds of Dask workers while keeping a 2.0x higher throughput to UCX.
Contributors:
Contributors:
Format
On-site
Beginner Level
60%
Intermediate Level
30%
Advanced Level
10%