Welcome to NVIDIA Dynamo#
The NVIDIA Dynamo Platform is a high-performance, low-latency inference framework designed to serve all AI models—across any framework, architecture, or deployment scale.
Dive in: Examples#
Demonstrates the basic concepts of Dynamo by creating a simple multi-service pipeline.
Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.
Demonstrates deployment for disaggregated serving on 3 nodes using nvidia/Llama-3.1-405B-Instruct-FP8.
Presents TensorRT-LLM examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.
Overview#
Dynamo is inference engine agnostic, supporting TRT-LLM, vLLM, SGLang, and others, and captures LLM-specific capabilities such as:
Disaggregated prefill & decode inference - Maximizes GPU throughput and facilitates trade off between throughput and latency.
Dynamic GPU scheduling - Optimizes performance based on fluctuating demand.
LLM-aware request routing - Eliminates unnecessary KV cache re-computation.
Accelerated data transfer - Reduces inference response time using NIXL.
KV cache offloading - Leverages several memory hierarchies for higher system throughput.
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and is driven by a transparent development approach. Check out our repo at ai-dynamo/.