

Accelerating Quantized Vision Inference on ARM Through Optimized Pointwise and Depthwise Convolutions
Wednesday, June 24, 2026 3:45 PM to 5:15 PM · 1 hr. 30 min. (Europe/Berlin)
Foyer D-G - 2nd Floor
Project Poster
AI Applications powered by HPC TechnologiesHW and SW Design for Scalable Machine LearningML Systems and FrameworksOptimizing for Energy and PerformanceParallel Numerical Algorithms
Information
Poster is on display.
As deep learning (DL) workloads are increasingly deployed on resource constrained and energy efficient systems, quantization has become an important technique for enabling high performance CPU inference, particularly for vision workloads. In modern efficient convolutional neural networks such as MobileNet, EfficientNet, and ShuffleNet, inference time is dominated by pointwise (1×1) and depthwise convolution operators. Pointwise convolution performs channel mixing using 1×1 kernels, while depthwise convolution applies spatial filtering independently per channel. Together, they form the core building blocks of depthwise separable convolutions. As these operators account for a large fraction of overall execution time, optimizing them is critical for achieving efficient quantized inference. However, existing CPU backends on ARM architectures do not provide highly optimized INT8 kernels for these convolution types.
This work presents optimized SVE JIT based INT8 convolution kernels for pointwise and depthwise convolutions on ARM architectures. Our approach reformulates convolutions as matrix multiplications and executes them using batch reduce GEMM primitives. Pointwise convolution is mapped to BRGEMM (Batch Reduce General Matrix Multiplication), which accumulates the results of a batch of small GEMM operations into a single output. Depthwise convolution is mapped to BRDGEMM (Batch Reduce Diagonal General Matrix Multiplication), which exploits the diagonal computation structure arising from channelwise independence. In this formulation, pointwise convolution activations and weights are transformed into dense matrices, whereas for depthwise convolution, activations are dense and weights are represented as diagonal matrices. The resulting computation is expressed as a batch of small matrix multiplications executed by BRGEMM and BRDGEMM kernels.
To efficiently realize this formulation on ARM, we design architecture specific JIT kernels using Scalable Vector Extension (SVE). JIT code generation enables specialization for convolution parameters, efficient vectorization, and low overhead execution, which are particularly important for small matrix workloads. We introduce INT8 support by designing data layouts, accumulation methods, and vectorized compute pipelines specifically optimized for INT8 execution on ARM SVE. The kernels leverage ARM INT8 dot product instructions to accumulate results in INT32. The accumulators are then converted to FP32 to apply quantization parameters (scales and zeropoints) and post operations (bias, sum, eltwise etc), followed by conversion to the required destination data type.
The proposed kernels are integrated into the oneDNN library and evaluated on ARM based systems with SVE support. Experimental results show up to 3x speedup over FP32 inference and up to 7x speedup over baseline INT8 implementations in PyTorch, leading to significant end-to-end performance improvements for quantized vision models. Overall, this work accelerates deep learning vision workloads on ARM based HPC systems by delivering scalable, energy-efficient INT8 inference and improved utilization of vector units for high throughput CPU execution.
As deep learning (DL) workloads are increasingly deployed on resource constrained and energy efficient systems, quantization has become an important technique for enabling high performance CPU inference, particularly for vision workloads. In modern efficient convolutional neural networks such as MobileNet, EfficientNet, and ShuffleNet, inference time is dominated by pointwise (1×1) and depthwise convolution operators. Pointwise convolution performs channel mixing using 1×1 kernels, while depthwise convolution applies spatial filtering independently per channel. Together, they form the core building blocks of depthwise separable convolutions. As these operators account for a large fraction of overall execution time, optimizing them is critical for achieving efficient quantized inference. However, existing CPU backends on ARM architectures do not provide highly optimized INT8 kernels for these convolution types.
This work presents optimized SVE JIT based INT8 convolution kernels for pointwise and depthwise convolutions on ARM architectures. Our approach reformulates convolutions as matrix multiplications and executes them using batch reduce GEMM primitives. Pointwise convolution is mapped to BRGEMM (Batch Reduce General Matrix Multiplication), which accumulates the results of a batch of small GEMM operations into a single output. Depthwise convolution is mapped to BRDGEMM (Batch Reduce Diagonal General Matrix Multiplication), which exploits the diagonal computation structure arising from channelwise independence. In this formulation, pointwise convolution activations and weights are transformed into dense matrices, whereas for depthwise convolution, activations are dense and weights are represented as diagonal matrices. The resulting computation is expressed as a batch of small matrix multiplications executed by BRGEMM and BRDGEMM kernels.
To efficiently realize this formulation on ARM, we design architecture specific JIT kernels using Scalable Vector Extension (SVE). JIT code generation enables specialization for convolution parameters, efficient vectorization, and low overhead execution, which are particularly important for small matrix workloads. We introduce INT8 support by designing data layouts, accumulation methods, and vectorized compute pipelines specifically optimized for INT8 execution on ARM SVE. The kernels leverage ARM INT8 dot product instructions to accumulate results in INT32. The accumulators are then converted to FP32 to apply quantization parameters (scales and zeropoints) and post operations (bias, sum, eltwise etc), followed by conversion to the required destination data type.
The proposed kernels are integrated into the oneDNN library and evaluated on ARM based systems with SVE support. Experimental results show up to 3x speedup over FP32 inference and up to 7x speedup over baseline INT8 implementations in PyTorch, leading to significant end-to-end performance improvements for quantized vision models. Overall, this work accelerates deep learning vision workloads on ARM based HPC systems by delivering scalable, energy-efficient INT8 inference and improved utilization of vector units for high throughput CPU execution.
Format
on-demandon-site

