> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/dsx/llms.txt.
> For full documentation content, see https://docs.nvidia.com/dsx/llms-full.txt.

# Component Layers

Each layer contains components that can be adopted independently or combined with others. Components within a layer often complement each other but do not require the full layer to function.

## Infrastructure Layer

Dynamic Resource Allocation (DRA) enables flexible GPU sharing in Kubernetes clusters. DRA ComputeDomains are a construct for managing distributed shared memory (IMEX). Together with advanced topology-aware schedulers, DRA powers predictable multi-node disaggregated model inference at production (rack) scale.

| Component                                                                                                                       | Purpose                                                                                                                                                                                                                                         |
| ------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [NVIDIA IMEX](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html)                                       | A facility that allows a set of GPUs to directly read / write each other's memory over a high-bandwidth NVLink. The connection may either be directly between GPUs on the same node or between GPUs on separate nodes connected by an NVSwitch. |
| [NVIDIA NVLink](https://www.nvidia.com/en-us/data-center/nvlink/)                                                               | Scale-up networking fabric with high-bandwidth GPU-to-GPU communications for AI training, inference, and other demanding rack-scale GPU-accelerated workloads.                                                                                  |
| [Dynamic Resource Allocation (DRA)](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-intro-install.html) | Controlled sharing and dynamic reconfiguration of GPUs. Enables ComputeDomains, which enable robust and secure Multi-Node NVLink (MNNVL) for NVIDIA GB200 and similar systems                                                                   |

## Optimization Layer

Components that prepare and optimize models for maximum inference performance. Use any combination based on your model types and requirements.

| Component                                                             | Purpose                                                                                                                                                                            |
| --------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) | State-of-the-art optimization techniques including quantization, distillation, pruning, speculative decoding, and sparsity for more efficient model checkpoints to deploy on GPUs. |
| [TensorRT](https://github.com/NVIDIA/TensorRT)                        | High-performance deep learning inference runtime for traditional DeepLearning models                                                                                               |
| [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)                | Purpose-built runtime with state-of-the-art inference optimization for GenAI models deployed on NVIDIA GPUs                                                                        |

## Deployment Layer

Cloud-native orchestration and infrastructure management. Start with GPU Operator for basic GPU management, then add components as scaling requirements grow. When deploying Dynamo with NVIDIA's Helm charts, you can [optionally install Grove and KAI Scheduler](https://docs.nvidia.com/dynamo/kubernetes-deployment/deployment-guide/detailed-installation-guide#path-a-production-install) in the same chart.

| Component                                                                                      | Purpose                                                                                         |
| ---------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
| [KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler/blob/main/README.md)                   | Kubernetes scheduler optimized for GPU resource allocation in AI/ML workloads                   |
| [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) | Manages GPU nodes in Kubernetes clusters like CPU nodes                                         |
| [Network Operator](https://docs.nvidia.com/networking/display/kubernetes25100/index.html)      | Enables RDMA and GPUDirect networking for high-performance workloads                            |
| [Grove](https://developer.nvidia.com/grove)                                                    | K8s API for defining gang scheduling and topology awareness for multinode disaggregated serving |
| [Planner](https://docs.nvidia.com/dynamo/latest/planner/planner_intro.html)                    | Real-time performance tuning for prefill and decode given SLA constraints (TTFT and ITL)        |

## Inference Serving Layer

The runtime engines that handle inference requests. Choose Triton for traditional ML workloads, Dynamo for GenAI, or both for mixed environments.

| Component                                                                                                           | Purpose                                                                                                                 |
| ------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| [Dynamo](https://docs.nvidia.com/dynamo/latest/index.html)                                                          | Purpose-built inference serving for GenAI workloads with disaggregated architecture support                             |
| [Router](https://docs.nvidia.com/dynamo/latest/router/README.html)                                                  | Smart request routing leveraging KV cache hit rate and load balancing, with extensible algorithms for agentic workloads |
| [Gateway API Inference Extension](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/README.md) | Integration of the k8s inference gateway with Dynamo                                                                    |

## Memory and Caching Layer

High-performance memory management and data transfer components. These unlock advanced capabilities like disaggregated serving and fast model loading when needed.

| Component                                                                      | Purpose                                                                                     |
| ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------- |
| [NIXL](https://docs.nvidia.com/dynamo/latest/api/nixl_connect/README.html)     | Point-to-point data transfer between GPUs and G1-G4 memory/storage tiers                    |
| [KV Block Manager](https://docs.nvidia.com/dynamo/latest/kvbm/kvbm_intro.html) | Manages KV cache across memory tiers with customizable eviction policies                    |
| [Model Express](https://github.com/ai-dynamo/modelexpress/blob/main/README.md) | Fast model weight loading via caching and NIXL transfer, also leveraged for fault tolerance |

## Performance Tooling

Tools for benchmarking and configuration. Use these independently or together to optimize your deployment.

| Component                                                     | Purpose                                                                                              |
| ------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
| [AIConfigurator](https://github.com/ai-dynamo/aiconfigurator) | Performance estimation for aggregated vs disaggregated serving based on model, ISL/OSL, and hardware |
| [AIPerf](https://github.com/ai-dynamo/aiperf)                 | Distributed benchmarking tool (Python-based, highly extensible)                                      |

## Container Registry

| Component                                  | Purpose                                                                                         |
| ------------------------------------------ | ----------------------------------------------------------------------------------------------- |
| [nvcr.io](https://catalog.ngc.nvidia.com/) | GPU-optimized containers, pretrained models, SDKs, and Helm charts for cloud-to-edge deployment |