Component Layers#

Each layer contains components that can be adopted independently or combined with others. Components within a layer often complement each other but do not require the full layer to function.

Infrastructure Layer#

Dynamic Resource Allocation (DRA) enables flexible GPU sharing in Kubernetes clusters. DRA ComputeDomains are a construct for managing distributed shared memory (IMEX). Together with advanced topology-aware schedulers, DRA powers predictable multi-node disaggregated model inference at production (rack) scale.

Component	Purpose
NVIDIA IMEX	A facility that allows a set of GPUs to directly read / write each other’s memory over a high-bandwidth NVLink. The connection may either be directly between GPUs on the same node or between GPUs on separate nodes connected by an NVSwitch.
NVIDIA NVLink	Scale-up networking fabric with high-bandwidth GPU-to-GPU communications for AI training, inference, and other demanding rack-scale GPU-accelerated workloads.
Dynamic Resource Allocation (DRA)	Controlled sharing and dynamic reconfiguration of GPUs. Enables ComputeDomains, which enable robust and secure Multi-Node NVLink (MNNVL) for NVIDIA GB200 and similar systems

Optimization Layer#

Components that prepare and optimize models for maximum inference performance. Use any combination based on your model types and requirements.

Component	Purpose
Model Optimizer	State-of-the-art optimization techniques including quantization, distillation, pruning, speculative decoding, and sparsity for more efficient model checkpoints to deploy on GPUs.
TensorRT	High-performance deep learning inference runtime for traditional DeepLearning models
TensorRT-LLM	Purpose-built runtime with state-of-the-art inference optimization for GenAI models deployed on NVIDIA GPUs

Deployment Layer#

Cloud-native orchestration and infrastructure management. Start with GPU Operator for basic GPU management, then add components as scaling requirements grow. When deploying Dynamo with NVIDIA’s Helm charts, you can optionally install Grove and KAI Scheduler in the same chart.

Component	Purpose
KAI Scheduler	Kubernetes scheduler optimized for GPU resource allocation in AI/ML workloads
GPU Operator	Manages GPU nodes in Kubernetes clusters like CPU nodes
Network Operator	Enables RDMA and GPUDirect networking for high-performance workloads
Grove	K8s API for defining gang scheduling and topology awareness for multinode disaggregated serving
Planner	Real-time performance tuning for prefill and decode given SLA constraints (TTFT and ITL)

Inference Serving Layer#

The runtime engines that handle inference requests. Choose Triton for traditional ML workloads, Dynamo for GenAI, or both for mixed environments.

Component	Purpose
Dynamo	Purpose-built inference serving for GenAI workloads with disaggregated architecture support
Router	Smart request routing leveraging KV cache hit rate and load balancing, with extensible algorithms for agentic workloads
Gateway API Inference Extension	Integration of the k8s inference gateway with Dynamo

Memory and Caching Layer#

High-performance memory management and data transfer components. These unlock advanced capabilities like disaggregated serving and fast model loading when needed.

Component	Purpose
NIXL	Point-to-point data transfer between GPUs and G1-G4 memory/storage tiers
KV Block Manager	Manages KV cache across memory tiers with customizable eviction policies
Model Express	Fast model weight loading via caching and NIXL transfer, also leveraged for fault tolerance

Performance Tooling#

Tools for benchmarking and configuration. Use these independently or together to optimize your deployment.

Component	Purpose
AIConfigurator	Performance estimation for aggregated vs disaggregated serving based on model, ISL/OSL, and hardware
AIPerf	Distributed benchmarking tool (Python-based, highly extensible)

Container Registry#

Component	Purpose
nvcr.io	GPU-optimized containers, pretrained models, SDKs, and Helm charts for cloud-to-edge deployment