vLLM Release 26.06

The NVIDIA vLLM Release 26.06 is made up of two container images available on NGC: vLLM.

Contents of the vLLM container

This container image contains the complete source of the version of vLLM in /opt/vllm. It is pre-built and installed in the default system Python environment (/usr/local/lib/python3.12/dist-packages/vllm) in the container image. Visit vLLM Docs to learn more about vLLM.

The NVIDIA vLLM Container is optimized for use with NVIDIA GPUs, and contains the following software for GPU acceleration

Please see to the CUDA section for the list of libraries inherited from the CUDA container.
vLLM: 0.22.1
flashinfer 0.6.12
transformers 5.6.0
flash-attention 2.7.4.post1
xgrammar 0.2.0
Torch 2.13

Driver Requirements

Release 26.06 is based on CUDA 13.3.0 For comprehensive and up-to-date driver compatibility information, please refer to the following documentation:

NVIDIA CUDA Compatibility Guide - Compatibility information between CUDA versions and driver releases
CUDA Toolkit Release Notes - Driver version requirements and compatibility matrices
NVIDIA Drivers Download - Latest NVIDIA drivers

Key Features and Enhancements

This vLLM release includes the following key features and enhancements.

Support Nemotron Super V3
Support Nemotron 3 Nano Omni
Support DeepSeek V4

Announcements

None.

Known Issues

vLLM serve uses aggressive GPU memory allocation by default (effectively --gpu-memory-utilization≈1.0). On systems with shared/unified GPU memory (e.g. DGX Spark or Jetson platforms), this can lead to out-of-memory errors. If you encounter OOM, start vllm serve with a lower utilization value, for example: vllm serve <model> --gpu-memory-utilization 0.7.
When running Nemotron Nano V3 or Nemotron Super V3 NVFP4 models on Spark it is required to limit the number of sequences to 4:
- vllm serve <model> --max-num-seqs 4&
The 26.06 vLLM container release includes 2 vulnerabilities (CVEs). See the details below:
- CVE-2026-6100
  - The use-after-free only triggers if a lzma / bz2 / gzip decompressor instance is re-used after a MemoryError under memory pressure. The AWS CLI bundled in the container uses one-shot decompression helpers and is not on the container's training / inference path, so the vulnerable condition is not reached.
- CVE-2026-7210
  - The vulnerable Python is only used by the AWS CLI bundled in the container as an operator-invoked tool — it is not exposed as a network service and the container's training / inference workloads do not invoke it. Customers using AWS CLI should only point it at trusted AWS endpoints and avoid feeding it XML from untrusted sources.
- Serving gpt-oss-120b on DGX Spark or Jetson platforms will result in an Out of Memory error, please use the 26.05 container.