Enabling vLLM on ARM for Scalable LLM Inference on Resource-Constrained Servers

Enabling vLLM on ARM for Scalable LLM Inference on Resource-Constrained Servers

Tuesday, June 10, 2025 3:00 PM to Thursday, June 12, 2025 4:00 PM · 2 days 1 hr. (Europe/Berlin)
Foyer D-G - 2nd floor
Project Poster
HPC in the Cloud and HPC ContainersHW and SW Design for Scalable Machine LearningLarge Language Models and Generative AI in HPCOptimizing for Energy and PerformanceSustainability and Energy Efficiency

Information

Poster is on display.
vLLM is an LLM serving framework built for deploying large language models (LLMs) with memory efficiency. It provides continuous batching and paged attention functionality enabling lower memory footprint and the right balance of request-level latency and throughput. Originally, vLLM was supported on GPUs and demonstrated 2-4x better throughput compared to other methods for fixed latency requirements. Support for x86 platforms was recently enabled.

We enable vLLM for ARM CPUs, extending support for PyTorch and OpenVINO backends. We develop specialized kernels using Neon and Scalable Vector Extension (SVE) intrinsics for SIMD vectorization on ARM. We observe 1.5x overall latency improvement with PyTorch, ~51x improvement in prefill latency and ~3x improvement in per-token decoding latency with OpenVINO over the baseline.
Contributors:
Format
On DemandOn Site

Log in

See all the content and easy-to-use features by logging in or registering!