

Enabling vLLM on ARM for Scalable LLM Inference on Resource-Constrained Servers
Tuesday, June 10, 2025 3:00 PM to Thursday, June 12, 2025 4:00 PM · 2 days 1 hr. (Europe/Berlin)
Foyer D-G - 2nd floor
Project Poster
HPC in the Cloud and HPC ContainersHW and SW Design for Scalable Machine LearningLarge Language Models and Generative AI in HPCOptimizing for Energy and PerformanceSustainability and Energy Efficiency
Information
Poster is on display.
vLLM is an LLM serving framework built for deploying large language models (LLMs) with memory efficiency. It provides continuous batching and paged attention functionality enabling lower memory footprint and the right balance of request-level latency and throughput. Originally, vLLM was supported on GPUs and demonstrated 2-4x better throughput compared to other methods for fixed latency requirements. Support for x86 platforms was recently enabled.
We enable vLLM for ARM CPUs, extending support for PyTorch and OpenVINO backends. We develop specialized kernels using Neon and Scalable Vector Extension (SVE) intrinsics for SIMD vectorization on ARM. We observe 1.5x overall latency improvement with PyTorch, ~51x improvement in prefill latency and ~3x improvement in per-token decoding latency with OpenVINO over the baseline.
Contributors:
vLLM is an LLM serving framework built for deploying large language models (LLMs) with memory efficiency. It provides continuous batching and paged attention functionality enabling lower memory footprint and the right balance of request-level latency and throughput. Originally, vLLM was supported on GPUs and demonstrated 2-4x better throughput compared to other methods for fixed latency requirements. Support for x86 platforms was recently enabled.
We enable vLLM for ARM CPUs, extending support for PyTorch and OpenVINO backends. We develop specialized kernels using Neon and Scalable Vector Extension (SVE) intrinsics for SIMD vectorization on ARM. We observe 1.5x overall latency improvement with PyTorch, ~51x improvement in prefill latency and ~3x improvement in per-token decoding latency with OpenVINO over the baseline.
Contributors:
Format
On DemandOn Site

