Welcome to NVIDIA Dynamo#

The NVIDIA Dynamo Platform is a high-performance, low-latency inference framework designed to serve all AI models—across any framework, architecture, or deployment scale.

Dive in: Examples#

Demonstrates the basic concepts of Dynamo by creating a simple multi-service pipeline.

Hello World Example: Basic Pipeline

Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.

LLM Deployment Examples

Demonstrates deployment for disaggregated serving on 3 nodes using nvidia/Llama-3.1-405B-Instruct-FP8.

Multinode Examples

Presents TensorRT-LLM examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.

LLM Deployment Examples using TensorRT-LLM

Overview#

Dynamo is inference engine agnostic, supporting TRT-LLM, vLLM, SGLang, and others, and captures LLM-specific capabilities such as:

  • Disaggregated prefill & decode inference - Maximizes GPU throughput and facilitates trade off between throughput and latency.

  • Dynamic GPU scheduling - Optimizes performance based on fluctuating demand.

  • LLM-aware request routing - Eliminates unnecessary KV cache re-computation.

  • Accelerated data transfer - Reduces inference response time using NIXL.

  • KV cache offloading - Leverages several memory hierarchies for higher system throughput.

Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and is driven by a transparent development approach. Check out our repo at ai-dynamo/.