NVIDIA Optimized Frameworks

vLLM Release 25.09

The NVIDIA vLLM Release 25.09 is made up of two container images available on NGC: vLLM.

Contents of the vLLM container

This container image contains the complete source of the version of vLLM in /opt/vllm. It is pre-built and installed in the default system Python environment (/usr/local/lib/python3.12/dist-packages/vllm) in the container image. Visit vLLM Docs to learn more about vLLM.

The NVIDIA vLLM Container is optimized for use with NVIDIA GPUs, and contains the following software for GPU acceleration

  • vLLM: 0.10.1.1
  • flashinfer 0.4.0
  • transformers 4.55.2
  • flash-attention 2.7.4
  • xgrammer 0.1.22
  • NVIDIA PyTorch 25.09

Driver Requirements

Release 25.09 is based on CUDA 13.0. For comprehensive and up-to-date driver compatibility information, please refer to the following documentation:

Key Features and Enhancements

This vLLM release includes the following key features and enhancements.

  • Compatibility with CUDA 13.0.
  • Support for multi-node configurations.
  • RTX PRO™ 6000 Blackwell Server Edition functional support.
  • DGX Spark functional support.
  • Jetson support.
  • Support for 8-bit floating point (FP8) precision on Hopper GPUs and above.
  • Support NVIDIA innovative 4-bit floating point NVFP4 format on Blackwell GPUs (including Jetson Thor and DGX Spark), which provides better training and inference performance with lower memory utilization.
  • Support for DeepSeek-R1, Llama-3.1-8B-Instruct

Announcements

  • 25.09 is the first NVIDIA vLLM container release that brings optimizations for NVIDIA GPUs.

Known Issues

  • None

© Copyright 2025, NVIDIA. Last updated on Oct 3, 2025.