NVIDIA Optimized Frameworks

vLLM Release 25.12

The NVIDIA vLLM Release 25.12 is made up of two container images available on NGC: vLLM.

Contents of the vLLM container

This container image contains the complete source of the version of vLLM in /opt/vllm. It is pre-built and installed in the default system Python environment (/usr/local/lib/python3.12/dist-packages/vllm) in the container image. Visit vLLM Docs to learn more about vLLM.

The NVIDIA vLLM Container is optimized for use with NVIDIA GPUs, and contains the following software for GPU acceleration

  • Please see to the CUDA section for the list of libraries inherited from the CUDA container.
  • NVIDIA CUDA 13.1.0.36
  • vLLM: 0.11.1
  • flashinfer 0.5.2
  • transformers 4.57.1
  • flash-attention 2.7.4.post1
  • xgrammer 0.1.25
  • torch-2.10.0a0+b4e4ee81d3

Driver Requirements

Release 25.12 is based on CUDA 13.1.0. For comprehensive and up-to-date driver compatibility information, please refer to the following documentation:

Key Features and Enhancements

This vLLM release includes the following key features and enhancements.

  • Support for openai/gpt-oss-20b and openai/gpt-oss-120b
  • Support Nemotron-Nano-V2

Announcements

  • None.

Known Issues

  • vllm serve uses aggressive GPU memory allocation by default (effectively --gpu-memory-utilization≈1.0). On systems with shared/unified GPU memory (e.g. DGX Spark or Jetson platforms), this can lead to out-of-memory errors. If you encounter OOM, start vllm serve with a lower utilization value, for example: vllm serve <model> --gpu-memory-utilization 0.7.

  • On DGX Spark, workloads utilizing FP8 models may fail with CUDA stream capture errors due to illegal synchronization operations in FlashInfer kernels. A fix is available in FlashInfer.
© Copyright 2025, NVIDIA. Last updated on Dec 22, 2025