Component Layers#
Each layer contains components that can be adopted independently or combined with others. Components within a layer often complement each other but do not require the full layer to function.
Infrastructure Layer#
Dynamic Resource Allocation (DRA) enables flexible GPU sharing in Kubernetes clusters. DRA ComputeDomains are a construct for managing distributed shared memory (IMEX). Together with advanced topology-aware schedulers, DRA powers predictable multi-node disaggregated model inference at production (rack) scale.
Component |
Purpose |
|---|---|
A facility that allows a set of GPUs to directly read / write each other’s memory over a high-bandwidth NVLink. The connection may either be directly between GPUs on the same node or between GPUs on separate nodes connected by an NVSwitch. |
|
Scale-up networking fabric with high-bandwidth GPU-to-GPU communications for AI training, inference, and other demanding rack-scale GPU-accelerated workloads. |
|
Controlled sharing and dynamic reconfiguration of GPUs. Enables ComputeDomains, which enable robust and secure Multi-Node NVLink (MNNVL) for NVIDIA GB200 and similar systems |
Optimization Layer#
Components that prepare and optimize models for maximum inference performance. Use any combination based on your model types and requirements.
Component |
Purpose |
|---|---|
State-of-the-art optimization techniques including quantization, distillation, pruning, speculative decoding, and sparsity for more efficient model checkpoints to deploy on GPUs. |
|
High-performance deep learning inference runtime for traditional DeepLearning models |
|
Purpose-built runtime with state-of-the-art inference optimization for GenAI models deployed on NVIDIA GPUs |
Deployment Layer#
Cloud-native orchestration and infrastructure management. Start with GPU Operator for basic GPU management, then add components as scaling requirements grow. When deploying Dynamo with NVIDIA’s Helm charts, you can optionally install Grove and KAI Scheduler in the same chart.
Component |
Purpose |
|---|---|
Kubernetes scheduler optimized for GPU resource allocation in AI/ML workloads |
|
Manages GPU nodes in Kubernetes clusters like CPU nodes |
|
Enables RDMA and GPUDirect networking for high-performance workloads |
|
K8s API for defining gang scheduling and topology awareness for multinode disaggregated serving |
|
Real-time performance tuning for prefill and decode given SLA constraints (TTFT and ITL) |
Inference Serving Layer#
The runtime engines that handle inference requests. Choose Triton for traditional ML workloads, Dynamo for GenAI, or both for mixed environments.
Component |
Purpose |
|---|---|
Purpose-built inference serving for GenAI workloads with disaggregated architecture support |
|
Smart request routing leveraging KV cache hit rate and load balancing, with extensible algorithms for agentic workloads |
|
Integration of the k8s inference gateway with Dynamo |
Memory and Caching Layer#
High-performance memory management and data transfer components. These unlock advanced capabilities like disaggregated serving and fast model loading when needed.
Component |
Purpose |
|---|---|
Point-to-point data transfer between GPUs and G1-G4 memory/storage tiers |
|
Manages KV cache across memory tiers with customizable eviction policies |
|
Fast model weight loading via caching and NIXL transfer, also leveraged for fault tolerance |
Performance Tooling#
Tools for benchmarking and configuration. Use these independently or together to optimize your deployment.
Component |
Purpose |
|---|---|
Performance estimation for aggregated vs disaggregated serving based on model, ISL/OSL, and hardware |
|
Distributed benchmarking tool (Python-based, highly extensible) |
Container Registry#
Component |
Purpose |
|---|---|
GPU-optimized containers, pretrained models, SDKs, and Helm charts for cloud-to-edge deployment |