NVIDIA Docs Hub Homepage NVIDIA Optimized Frameworks NVIDIA Optimized Frameworks vLLM Release 25.09

vLLM Release 25.09

The NVIDIA vLLM Release 25.09 is made up of two container images available on NGC: vLLM.

Contents of the vLLM container

This container image contains the complete source of the version of vLLM in /opt/vllm. It is pre-built and installed in the default system Python environment (/usr/local/lib/python3.12/dist-packages/vllm) in the container image. Visit vLLM Docs to learn more about vLLM.

The NVIDIA vLLM Container is optimized for use with NVIDIA GPUs, and contains the following software for GPU acceleration

vLLM: 0.10.1.1
flashinfer 0.4.0
transformers 4.55.2
flash-attention 2.7.4
xgrammer 0.1.22
NVIDIA PyTorch 25.09

Driver Requirements

Release 25.09 is based on CUDA 13.0. For comprehensive and up-to-date driver compatibility information, please refer to the following documentation:

NVIDIA CUDA Compatibility Guide - Compatibility information between CUDA versions and driver releases
CUDA Toolkit Release Notes - Driver version requirements and compatibility matrices
NVIDIA Drivers Download - Latest NVIDIA drivers

Key Features and Enhancements

This vLLM release includes the following key features and enhancements.

Compatibility with CUDA 13.0.
Support for multi-node configurations.
RTX PRO™ 6000 Blackwell Server Edition functional support.
DGX Spark functional support.
Jetson support.
Support for 8-bit floating point (FP8) precision on Hopper GPUs and above.
Support NVIDIA innovative 4-bit floating point NVFP4 format on Blackwell GPUs (including Jetson Thor and DGX Spark), which provides better training and inference performance with lower memory utilization.
Support for DeepSeek-R1, Llama-3.1-8B-Instruct

Announcements

25.09 is the first NVIDIA vLLM container release that brings optimizations for NVIDIA GPUs.

Known Issues

None