vLLM Release 25.10

The NVIDIA vLLM Release 25.10 is made up of two container images available on NGC: vLLM.

Contents of the vLLM container

This container image contains the complete source of the version of vLLM in /opt/vllm. It is pre-built and installed in the default system Python environment (/usr/local/lib/python3.12/dist-packages/vllm) in the container image. Visit vLLM Docs to learn more about vLLM.

The NVIDIA vLLM Container is optimized for use with NVIDIA GPUs, and contains the following software for GPU acceleration

vLLM: 0.10.2
flashinfer 0.4.0
transformers 4.56.1
flash-attention 2.7.4
xgrammer 0.1.24
NVIDIA PyTorch 25.09

Driver Requirements

Release 25.10 is based on CUDA 13.0.2. For comprehensive and up-to-date driver compatibility information, please refer to the following documentation:

NVIDIA CUDA Compatibility Guide - Compatibility information between CUDA versions and driver releases
CUDA Toolkit Release Notes - Driver version requirements and compatibility matrices
NVIDIA Drivers Download - Latest NVIDIA drivers

Key Features and Enhancements

This vLLM release includes the following key features and enhancements.

Support for openai/gpt-oss-20b and openai/gpt-oss-120b

Announcements

None.

Known Issues

vllm serve uses aggressive GPU memory allocation by default (effectively --gpu-memory-utilization≈1.0). On systems with shared/unified GPU memory (e.g. DGX Spark or Jetson platforms), this can lead to out-of-memory errors. If you encounter OOM, start vllm serve with a lower utilization value, for example: vllm serve <model> --gpu-memory-utilization 0.7.