Component Layers

View as Markdown

Each layer contains components that can be adopted independently or combined with others. Components within a layer often complement each other but do not require the full layer to function.

Infrastructure Layer

Dynamic Resource Allocation (DRA) enables flexible GPU sharing in Kubernetes clusters. DRA ComputeDomains are a construct for managing distributed shared memory (IMEX). Together with advanced topology-aware schedulers, DRA powers predictable multi-node disaggregated model inference at production (rack) scale.

ComponentPurpose
NVIDIA IMEXA facility that allows a set of GPUs to directly read / write each other’s memory over a high-bandwidth NVLink. The connection may either be directly between GPUs on the same node or between GPUs on separate nodes connected by an NVSwitch.
NVIDIA NVLinkScale-up networking fabric with high-bandwidth GPU-to-GPU communications for AI training, inference, and other demanding rack-scale GPU-accelerated workloads.
Dynamic Resource Allocation (DRA)Controlled sharing and dynamic reconfiguration of GPUs. Enables ComputeDomains, which enable robust and secure Multi-Node NVLink (MNNVL) for NVIDIA GB200 and similar systems

Optimization Layer

Components that prepare and optimize models for maximum inference performance. Use any combination based on your model types and requirements.

ComponentPurpose
Model OptimizerState-of-the-art optimization techniques including quantization, distillation, pruning, speculative decoding, and sparsity for more efficient model checkpoints to deploy on GPUs.
TensorRTHigh-performance deep learning inference runtime for traditional DeepLearning models
TensorRT-LLMPurpose-built runtime with state-of-the-art inference optimization for GenAI models deployed on NVIDIA GPUs

Deployment Layer

Cloud-native orchestration and infrastructure management. Start with GPU Operator for basic GPU management, then add components as scaling requirements grow. When deploying Dynamo with NVIDIA’s Helm charts, you can optionally install Grove and KAI Scheduler in the same chart.

ComponentPurpose
KAI SchedulerKubernetes scheduler optimized for GPU resource allocation in AI/ML workloads
GPU OperatorManages GPU nodes in Kubernetes clusters like CPU nodes
Network OperatorEnables RDMA and GPUDirect networking for high-performance workloads
GroveK8s API for defining gang scheduling and topology awareness for multinode disaggregated serving
PlannerReal-time performance tuning for prefill and decode given SLA constraints (TTFT and ITL)

Inference Serving Layer

The runtime engines that handle inference requests. Choose Triton for traditional ML workloads, Dynamo for GenAI, or both for mixed environments.

ComponentPurpose
DynamoPurpose-built inference serving for GenAI workloads with disaggregated architecture support
RouterSmart request routing leveraging KV cache hit rate and load balancing, with extensible algorithms for agentic workloads
Gateway API Inference ExtensionIntegration of the k8s inference gateway with Dynamo

Memory and Caching Layer

High-performance memory management and data transfer components. These unlock advanced capabilities like disaggregated serving and fast model loading when needed.

ComponentPurpose
NIXLPoint-to-point data transfer between GPUs and G1-G4 memory/storage tiers
KV Block ManagerManages KV cache across memory tiers with customizable eviction policies
Model ExpressFast model weight loading via caching and NIXL transfer, also leveraged for fault tolerance

Performance Tooling

Tools for benchmarking and configuration. Use these independently or together to optimize your deployment.

ComponentPurpose
AIConfiguratorPerformance estimation for aggregated vs disaggregated serving based on model, ISL/OSL, and hardware
AIPerfDistributed benchmarking tool (Python-based, highly extensible)

Container Registry

ComponentPurpose
nvcr.ioGPU-optimized containers, pretrained models, SDKs, and Helm charts for cloud-to-edge deployment