vLLM Release 26.03

The NVIDIA vLLM Release 26.03 is made up of two container images available on NGC: vLLM.

Contents of the vLLM container

This container image contains the complete source of the version of vLLM in /opt/vllm. It is pre-built and installed in the default system Python environment (/usr/local/lib/python3.12/dist-packages/vllm) in the container image. Visit vLLM Docs to learn more about vLLM.

The NVIDIA vLLM Container is optimized for use with NVIDIA GPUs, and contains the following software for GPU acceleration

Please see to the CUDA section for the list of libraries inherited from the CUDA container.
vLLM: 0.17.1
flashinfer 0.6.7
transformers 4.57.5
flash-attention 2.7.4.post1
xgrammar 0.1.32
2.11.0a0+a6c236b9fd1

Driver Requirements

Release 26.03 is based on CUDA 13.2.0 For comprehensive and up-to-date driver compatibility information, please refer to the following documentation:

NVIDIA CUDA Compatibility Guide - Compatibility information between CUDA versions and driver releases
CUDA Toolkit Release Notes - Driver version requirements and compatibility matrices
NVIDIA Drivers Download - Latest NVIDIA drivers

Key Features and Enhancements

This vLLM release includes the following key features and enhancements.

Support Nemotron Super V3

Announcements

None.

Known Issues

vLLM serve uses aggressive GPU memory allocation by default (effectively --gpu-memory-utilization≈1.0). On systems with shared/unified GPU memory (e.g. DGX Spark or Jetson platforms), this can lead to out-of-memory errors. If you encounter OOM, start vllm serve with a lower utilization value, for example: vllm serve <model> --gpu-memory-utilization 0.7.