vLLM Release 25.12

The NVIDIA vLLM Release 25.12 is made up of two container images available on NGC: vLLM.

Contents of the vLLM container

This container image contains the complete source of the version of vLLM in /opt/vllm. It is pre-built and installed in the default system Python environment (/usr/local/lib/python3.12/dist-packages/vllm) in the container image. Visit vLLM Docs to learn more about vLLM.

The NVIDIA vLLM Container is optimized for use with NVIDIA GPUs, and contains the following software for GPU acceleration

Please see to the CUDA section for the list of libraries inherited from the CUDA container.
NVIDIA CUDA 13.1.0.36
vLLM: 0.11.1
flashinfer 0.5.2
transformers 4.57.1
flash-attention 2.7.4.post1
xgrammer 0.1.25
torch-2.10.0a0+b4e4ee81d3

Driver Requirements

Release 25.12 is based on CUDA 13.1.0. For comprehensive and up-to-date driver compatibility information, please refer to the following documentation:

NVIDIA CUDA Compatibility Guide - Compatibility information between CUDA versions and driver releases
CUDA Toolkit Release Notes - Driver version requirements and compatibility matrices
NVIDIA Drivers Download - Latest NVIDIA drivers

Key Features and Enhancements

This vLLM release includes the following key features and enhancements.

Support for openai/gpt-oss-20b and openai/gpt-oss-120b
Support Nemotron-Nano-V2

Announcements

None.

Known Issues

vLLM serve uses aggressive GPU memory allocation by default (effectively --gpu-memory-utilization≈1.0). On systems with shared/unified GPU memory (e.g. DGX Spark or Jetson platforms), this can lead to out-of-memory errors. If you encounter OOM, start vllm serve with a lower utilization value, for example: vllm serve <model> --gpu-memory-utilization 0.7.
On DGX Spark, workloads utilizing FP8 models may fail with CUDA stream capture errors due to illegal synchronization operations in FlashInfer kernels. A fix is available in FlashInfer.