Component Layers#

Each layer contains components that can be adopted independently or combined with others. Components within a layer often complement each other but do not require the full layer to function.

Infrastructure Layer#

Dynamic Resource Allocation (DRA) enables flexible GPU sharing in Kubernetes clusters. DRA ComputeDomains are a construct for managing distributed shared memory (IMEX). Together with advanced topology-aware schedulers, DRA powers predictable multi-node disaggregated model inference at production (rack) scale.

Component

Purpose

NVIDIA IMEX

A facility that allows a set of GPUs to directly read / write each other’s memory over a high-bandwidth NVLink. The connection may either be directly between GPUs on the same node or between GPUs on separate nodes connected by an NVSwitch.

NVIDIA NVLink

Scale-up networking fabric with high-bandwidth GPU-to-GPU communications for AI training, inference, and other demanding rack-scale GPU-accelerated workloads.

Dynamic Resource Allocation (DRA)

Controlled sharing and dynamic reconfiguration of GPUs. Enables ComputeDomains, which enable robust and secure Multi-Node NVLink (MNNVL) for NVIDIA GB200 and similar systems

Optimization Layer#

Components that prepare and optimize models for maximum inference performance. Use any combination based on your model types and requirements.

Component

Purpose

Model Optimizer

State-of-the-art optimization techniques including quantization, distillation, pruning, speculative decoding, and sparsity for more efficient model checkpoints to deploy on GPUs.

TensorRT

High-performance deep learning inference runtime for traditional DeepLearning models

TensorRT-LLM

Purpose-built runtime with state-of-the-art inference optimization for GenAI models deployed on NVIDIA GPUs

Deployment Layer#

Cloud-native orchestration and infrastructure management. Start with GPU Operator for basic GPU management, then add components as scaling requirements grow. When deploying Dynamo with NVIDIA’s Helm charts, you can optionally install Grove and KAI Scheduler in the same chart.

Component

Purpose

KAI Scheduler

Kubernetes scheduler optimized for GPU resource allocation in AI/ML workloads

GPU Operator

Manages GPU nodes in Kubernetes clusters like CPU nodes

Network Operator

Enables RDMA and GPUDirect networking for high-performance workloads

Grove

K8s API for defining gang scheduling and topology awareness for multinode disaggregated serving

Planner

Real-time performance tuning for prefill and decode given SLA constraints (TTFT and ITL)

Inference Serving Layer#

The runtime engines that handle inference requests. Choose Triton for traditional ML workloads, Dynamo for GenAI, or both for mixed environments.

Component

Purpose

Dynamo

Purpose-built inference serving for GenAI workloads with disaggregated architecture support

Router

Smart request routing leveraging KV cache hit rate and load balancing, with extensible algorithms for agentic workloads

Gateway API Inference Extension

Integration of the k8s inference gateway with Dynamo

Memory and Caching Layer#

High-performance memory management and data transfer components. These unlock advanced capabilities like disaggregated serving and fast model loading when needed.

Component

Purpose

NIXL

Point-to-point data transfer between GPUs and G1-G4 memory/storage tiers

KV Block Manager

Manages KV cache across memory tiers with customizable eviction policies

Model Express

Fast model weight loading via caching and NIXL transfer, also leveraged for fault tolerance

Performance Tooling#

Tools for benchmarking and configuration. Use these independently or together to optimize your deployment.

Component

Purpose

AIConfigurator

Performance estimation for aggregated vs disaggregated serving based on model, ISL/OSL, and hardware

AIPerf

Distributed benchmarking tool (Python-based, highly extensible)

Container Registry#

Component

Purpose

nvcr.io

GPU-optimized containers, pretrained models, SDKs, and Helm charts for cloud-to-edge deployment