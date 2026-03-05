vLLM Release 26.02
The NVIDIA vLLM Release 26.02 is made up of two container images available on NGC: vLLM.
Contents of the vLLM container
This container image contains the complete source of the version of vLLM in /opt/vllm. It is pre-built and installed in the default system Python environment (/usr/local/lib/python3.12/dist-packages/vllm) in the container image. Visit vLLM Docs to learn more about vLLM.
The NVIDIA vLLM Container is optimized for use with NVIDIA GPUs, and contains the following software for GPU acceleration
- Please see to the CUDA section for the list of libraries inherited from the CUDA container.
- vLLM: 0.15.1
- flashinfer 0.6.1
- transformers 4.57.5
- flash-attention 2.7.4.post1
- xgrammar 0.1.27
- torch 2.11.0a0+eb65b36914
Driver Requirements
Release 26.02 is based on CUDA 13.1.1 For comprehensive and up-to-date driver compatibility information, please refer to the following documentation:
- NVIDIA CUDA Compatibility Guide - Compatibility information between CUDA versions and driver releases
- CUDA Toolkit Release Notes - Driver version requirements and compatibility matrices
- NVIDIA Drivers Download - Latest NVIDIA drivers
Key Features and Enhancements
This vLLM release includes the following key features and enhancements.
- Support for
openai/gpt-oss-20band
openai/gpt-oss-120b
- Support Nemotron-Nano-V2
Announcements
- None.
Known Issues
-
vLLM serve uses aggressive GPU memory allocation by default (effectively --gpu-memory-utilization≈1.0). On systems with shared/unified GPU memory (e.g. DGX Spark or Jetson platforms), this can lead to out-of-memory errors. If you encounter OOM, start vllm serve with a lower utilization value, for example:
vllm serve <model> --gpu-memory-utilization 0.7.
- On DGX Spark, workloads utilizing FP8 models may fail with CUDA stream capture errors due to illegal synchronization operations in FlashInfer kernels. A fix is available in FlashInfer.
- When running Nemotron Super V3 model with FP8 it is required to use the Triton back end for attention, for example
vllm serve <model> --attention-backend triton_attn&