vLLM Release 26.06
The NVIDIA vLLM Release 26.06 is made up of two container images available on NGC: vLLM.
Contents of the vLLM container
This container image contains the complete source of the version of vLLM in /opt/vllm. It is pre-built and installed in the default system Python environment (/usr/local/lib/python3.12/dist-packages/vllm) in the container image. Visit vLLM Docs to learn more about vLLM.
The NVIDIA vLLM Container is optimized for use with NVIDIA GPUs, and contains the following software for GPU acceleration
- Please see to the CUDA section for the list of libraries inherited from the CUDA container.
- vLLM: 0.22.1
- flashinfer 0.6.12
- transformers 5.6.0
- flash-attention 2.7.4.post1
- xgrammar 0.2.0
- Torch 2.13
Driver Requirements
Release 26.06 is based on CUDA 13.3.0 For comprehensive and up-to-date driver compatibility information, please refer to the following documentation:
- NVIDIA CUDA Compatibility Guide - Compatibility information between CUDA versions and driver releases
- CUDA Toolkit Release Notes - Driver version requirements and compatibility matrices
- NVIDIA Drivers Download - Latest NVIDIA drivers
Key Features and Enhancements
This vLLM release includes the following key features and enhancements.
- Support Nemotron Super V3
- Support Nemotron 3 Nano Omni
- Support DeepSeek V4
Announcements
- None.
Known Issues
-
vLLM serveuses aggressive GPU memory allocation by default (effectively --gpu-memory-utilization≈1.0). On systems with shared/unified GPU memory (e.g. DGX Spark or Jetson platforms), this can lead to out-of-memory errors. If you encounter OOM, start vllm serve with a lower utilization value, for example:vllm serve <model> --gpu-memory-utilization 0.7. - When running Nemotron Nano V3 or Nemotron Super V3 NVFP4 models on Spark it is required to limit the number of sequences to 4:
- vllm serve <model> --max-num-seqs 4&
- The 26.06 vLLM container release includes 2 vulnerabilities (CVEs). See the details below:
- CVE-2026-6100
- The use-after-free only triggers if a lzma / bz2 / gzip decompressor instance is re-used after a MemoryError under memory pressure. The AWS CLI bundled in the container uses one-shot decompression helpers and is not on the container's training / inference path, so the vulnerable condition is not reached.
- CVE-2026-7210
- The vulnerable Python is only used by the AWS CLI bundled in the container as an operator-invoked tool — it is not exposed as a network service and the container's training / inference workloads do not invoke it. Customers using AWS CLI should only point it at trusted AWS endpoints and avoid feeding it XML from untrusted sources.
- Serving gpt-oss-120b on DGX Spark or Jetson platforms will result in an
Out of Memoryerror, please use the 26.05 container.
- CVE-2026-6100