NVIDIA Inference Reference Architecture

View as Markdown

Introduction

This reference architecture describes an inference-focused software stack for NVIDIA Cloud Partners. It helps operators build a cloud-native platform that can host large language model services, multimodal services, traditional machine learning inference, asynchronous GPU tasks, and partner-specific AI platforms on a shared NVIDIA accelerated infrastructure.

The design assumes the operator needs infrastructure that behaves more like a cloud service than a static cluster. Tenants should be able to request endpoints, models, Kubernetes capacity, GPU workers, model storage, and operations support without understanding every physical detail in the data center. Operators still need clear control over placement, isolation, performance, health, and lifecycle management.

The architecture uses a layered approach. It starts with data center assumptions, then shows tenant and operator views, then maps those views to infrastructure, Kubernetes, AI platform services, inference serving, model data movement, validation, telemetry, break-fix, performance, and security.

Executive Summary

NVIDIA inference deployments now need to handle large language models, multimodal models, traditional ML services, asynchronous GPU tasks, and platform APIs across many GPU nodes. This architecture defines a repeatable stack for those needs: cluster recipes and operators at the base, topology-aware scheduling and health management in the platform, model and cache movement in the data plane, model-serving orchestration above inference engines, and benchmarking plus validation around the full system.

The architecture defines component roles and integration points for a complete inference platform. Claims that require lab results are framed as validation work to run, not as completed proof.

Target Audience

  • NVIDIA Cloud Partner platform teams building inference services on GPU Kubernetes clusters.
  • Solution architects mapping NVIDIA software components to partner infrastructure.
  • Deployment engineers who need a Day 0, Day 1, and Day 2 path from cluster baseline to serving workload.
  • Performance and validation teams that need repeatable source, version, and benchmark evidence.
  • Product and field teams that need a current architecture view tied to validated components.

Customer Problem

Inference platforms are no longer a single model server behind a load balancer. Large models can require multi-node placement, separate prefill and decode pools, fast model-weight movement, cache-aware routing, GPU-aware scheduling, and operational remediation when GPU or network health changes. Partner teams need a way to assemble these capabilities without losing traceability to the component versions and design boundaries that define the stack.

Design Goals

  • Provide a repeatable full-stack blueprint for partner inference deployments.
  • Separate infrastructure, model data, serving, optimization, validation, and operations concerns.
  • Make common deployment combinations explicit so teams can adopt the full stack or a narrower subset.
  • Document compatibility, validation inputs, deployment flow, and operations boundaries.
  • Convert the reference architecture into a repeatable acceptance bar with explicit checks for cluster baseline, network, storage, scheduling, serving, telemetry, and security.
  • Make backend disposition, tenancy, routing, model-data movement, cache behavior, and release validation visible instead of leaving them as local implementation choices.
  • Require each major section to carry enough implementation detail for an operator to configure, validate, or reject a deployment choice.
  • Keep unproven performance, support, and validation claims out of the reader-facing architecture.

Architecture Decision Guardrails

The base RA should mandate evidence, ownership, and decision records, not a single provider implementation. Treat partner-selected ecosystem software, physical cluster shape, DPU placement, fabric topology, storage product, gateway implementation, and provisioning tool as architecture decisions unless an NVIDIA component contract requires them.

TopicReference RequirementArchitecture Decision To Record
Hardware profileEvery deployment needs a named hardware profile and validation evidence.Select RTX PRO, HGX, GB300 NVL72, cloud-hosted, or lab profile before mandating fabric, DPU, cooling, rack, or tenancy assumptions.
Network and DPU placementRequired traffic classes and acceptance tests must be explicit.Treat RDMA, GPUDirect RDMA, dual-plane networking, BlueField use, and separate compute fabric as profile decisions instead of universal dependencies.
Serving topologyBackend disposition, routing ownership, KV behavior, and rollback paths must be documented.Decide aggregated versus disaggregated serving, prefill/decode pool shape, model-data tiering, and cache-aware routing based on measured workload behavior.
Kubernetes interfaceThe platform needs a declared Kubernetes API surface and upgrade policy.Decide whether the Kubernetes Gateway API Inference Extension is used for InferencePool and InferenceModel interoperability, while keeping Dynamo responsible for LLM serving orchestration where selected.
External ecosystem toolsDo not add non-NVIDIA implementation software as reference dependencies.Record partner-selected storage, gateway, registry, provisioning, identity, and automation tools as local implementation decisions outside the base RA.

Data Center Architecture

This architecture assumes an NCP data center built from GPU compute PODs and a core POD. GPU PODs host accelerated tenant work. The core POD hosts shared control planes, storage services, observability backends, registries, validation services, and operator automation. The selected hardware profile should expose the practical domains needed by the inference platform: tenant access, secure management, cluster interconnect where required, and local GPU scale-up where supported.

DomainPurposeInference Architecture Guidance
GPU Compute PODRuns tenant GPU workloads, model-serving workers, and accelerated data-plane services.Use rack-scale GPU domains for large models that need multi-node tensor, pipeline, or expert parallel execution. Keep placement policy aware of rack, NVLink, NIC rail, and failure-domain boundaries.
Core PODHosts control planes, registries, shared services, telemetry, validation services, and storage control planes.Run platform services such as Kubernetes control planes, NVCF services, registries, metadata stores, observability backends, and automation systems away from tenant GPU worker nodes when possible.
Tenant Access NetworkProvides north-south API, user, storage, registry, and shared-service access.Expose inference APIs and platform APIs through tenant-aware ingress, gateway, rate-limit, and identity boundaries.
Cluster Interconnect NetworkProvides east-west traffic for distributed inference, model data movement, and high-throughput service coordination.Validate RDMA, GPUDirect RDMA, NIC rail alignment, and congestion controls before accepting multi-node LLM workloads.
NVLink DomainProvides local scale-up GPU bandwidth inside the rack or node domain.Prefer topology-aware placement for model shards, prefill pools, decode pools, and data-plane services that exchange KV cache or weights at high rate.
Storage DomainProvides object, file, block, and local ephemeral capacity for models, containers, logs, cache, and benchmark artifacts.Separate model artifact storage, cache storage, benchmark output, container image cache, and telemetry retention tiers.

GPU Compute Node

GPU compute nodes run the endpoint workers, prefill workers, decode workers, model data-plane services, benchmark jobs, and supporting sidecars. The node design should account for GPU memory capacity, CPU memory, local NVMe capacity, NIC count, NIC speed, PCIe topology, and GPU-to-GPU topology. For large models, node placement must be considered part of the architecture because a poor placement decision can turn a valid software stack into a slow service.

The baseline node software includes a supported OS image, container runtime, NVIDIA driver, NVIDIA Container Toolkit, GPU device plugin, DCGM telemetry, and the networking stack required for RDMA or GPUDirect RDMA when those paths are part of the design.

Networking

The inference platform uses multiple networks with different responsibilities.

  • Tenant access traffic carries user APIs, control plane calls, endpoint ingress, storage access, registry access, and shared-service traffic.
  • Secure management traffic carries platform administration, out-of-band management, provisioning, and operator automation.
  • Cluster interconnect traffic carries east-west model serving traffic, distributed inference coordination, model data movement, and high-volume cache or tensor movement when the workload profile requires it.
  • NVLink traffic carries local scale-up GPU communication inside the supported GPU topology.

For disaggregated serving, network design must be reviewed together with serving topology. Prefill pools, decode pools, routers, cache services, and model-weight transfer services should be placed so the highest-volume traffic remains on the most appropriate fabric. RDMA, GPUDirect RDMA, separate east-west compute fabric, DPU offload, and dual-plane topology are decisions to validate against the selected RTX PRO, HGX, GB300 NVL72, cloud-hosted, or lab profile.

Before endpoint validation begins, the operator should validate the required network paths independently. The minimum network evidence should cover RDMA readiness where required, GPUDirect RDMA readiness where required, cross-rail behavior where present, congestion behavior, and degradation behavior when a network path fails or becomes saturated.

Storage

Inference needs more than one storage tier. Model artifacts often start in object or file storage. Hot models may need node-local cache, shared cache, or GPU-to-GPU transfer paths. Logs, traces, metrics, benchmark outputs, and validation reports need retention policies that are separate from model storage. Local NVMe is useful for image cache, model cache, temporary tensors, and short-lived logs, but tenant data must be sanitized when infrastructure changes ownership.

The architecture should distinguish persistent model artifact storage from ephemeral cache and from telemetry retention. ModelExpress, NIXL, Velo, FlexTensor, and model streaming components sit in this boundary between storage, memory, and runtime workers.

Storage validation should happen before model-serving validation. The operator should record model-load throughput, cache-hit behavior, shared-file-system behavior, local-NVMe behavior, and failure behavior for storage or fabric degradation. Where local SSD is used for cache or KV offload, the design should include wear review and data-sanitization procedures.

Data Center View

At the data center level, the inference platform should be viewed as a set of control planes and data planes. Control planes run in core services or per-tenant platform clusters. Data planes run close to the GPU nodes. The operator should document which services run in the core POD, which run in tenant Kubernetes clusters, which run as DaemonSets, which run as endpoint-specific workers, and which require direct access to GPUs, NICs, or local storage.

NCP Inference Reference Architecture Introduction

Tenant Compute View

From a tenant point of view, the platform exposes AI services, Kubernetes resources, model endpoints, and operational status. The tenant should not need to understand the entire data center, but the platform must preserve enough topology and resource intent to run high-performance inference.

LayerTenant-Facing CapabilityPrimary NVIDIA Components
AI PlatformInference endpoints, function/task lifecycle, traffic routing, model serving, endpoint telemetry, and user-facing APIs.NVIDIA Cloud Functions, NVIDIA Dynamo, NVIDIA TensorRT-LLM, NVIDIA TensorRT
Managed KubernetesPer-tenant or per-environment cluster abstraction, GPU worker pools, admission, autoscaling, scheduling, service discovery, and workload lifecycle.NVIDIA GPU Operator, NVIDIA Network Operator, KAI Scheduler, NVIDIA Grove
Infrastructure ServicesCluster recipes, GPU node enablement, network enablement, health remediation, validation, and resource lifecycle.NVIDIA AI Cluster Runtime, NVIDIA NVSentinel, NVIDIA ISV NCP Validation Suite
Model Data PlaneModel artifact distribution, weight transfer, KV and tensor movement, cache locality, large payload staging, and memory-tier extension.NVIDIA ModelExpress, NVIDIA Inference Xfer Library, NVIDIA Velo, NVIDIA FlexTensor, Run:ai Model Streamer
Optimization And ValidationModel tuning, runtime preparation, configuration search, benchmark execution, and acceptance evidence.NVIDIA Model Optimizer, NVIDIA AITune, NVIDIA DALI, NVIDIA AIConfigurator, NVIDIA AIPerf

Operator View

From an operator point of view, the stack is a set of lifecycle systems. The operator builds the cluster baseline, enables GPU and network resources, exposes tenant or platform control planes, schedules workloads, moves model data, monitors health, responds to failures, and validates changes before handing capacity to tenants.

Operator Control PlaneResponsibilityArchitecture Notes
Cloud And Cluster BaselineDefine cloud, accelerator, OS, Kubernetes, operator, and workload-intent combinations.Use AI Cluster Runtime recipes and snapshots as the repeatable baseline before installing the inference stack.
GPU EnablementInstall and manage drivers, container toolkit integration, device plugins, GPU feature discovery, and monitoring operands.Use GPU Operator as the standard Kubernetes GPU enablement layer.
Network EnablementInstall and manage RDMA and GPUDirect RDMA networking components.Use Network Operator where the deployment requires high-performance east-west traffic for inference workers or model data movement.
Placement And SchedulingAllocate GPUs fairly, schedule multi-pod units together, preserve startup order, and account for topology.Use KAI Scheduler for queue and allocation policy and Grove for gang scheduling, startup ordering, and topology-aware groups.
Health And Break-FixDetect hardware and system faults, cordon or drain affected nodes, run remediation, and validate return to service.Use NVSentinel with platform runbooks, GPU Operator state, and validation tests to close the loop.
Validation And AcceptanceRun cluster, serving, performance, and operations tests before handing the platform to tenants.Use the ISV NCP Validation Suite and AIPerf to produce acceptance evidence for the selected component combination.

NCP Inference Reference Architecture

Solution Overview

The architecture uses a layered model. Platform APIs and workload endpoints sit above the serving layer. Serving frameworks coordinate inference engines and request flow. Optimization tools prepare models and execution paths. Model data services move weights, tensors, KV cache blocks, and large payloads. Kubernetes infrastructure components provide GPU enablement, networking, scheduling, placement, health, and repeatable recipes. Benchmarking and validation tools close the loop.

Component Layers

LayerIntentComponents
API And Experience LayerExpose inference endpoints, platform APIs, and operator workflows.NVIDIA Cloud Functions
Inference Serving LayerCoordinate model-serving engines, request routing, prefill, decode, and service scale.NVIDIA Dynamo, NVIDIA TensorRT-LLM, NVIDIA TensorRT
Optimization LayerTune models, execution graphs, and serving configurations for NVIDIA GPUs.NVIDIA Model Optimizer, NVIDIA AITune, NVIDIA DALI
Model Data And Memory LayerMove weights, tensors, KV cache blocks, and large payloads across memory and storage tiers.NVIDIA ModelExpress, NVIDIA Inference Xfer Library, NVIDIA Velo, NVIDIA FlexTensor, Run:ai Model Streamer
Cloud Orchestration LayerRun and place inference workloads on Kubernetes with GPU, network, scheduling, and health controls.NVIDIA AI Cluster Runtime, NVIDIA GPU Operator, NVIDIA Network Operator, KAI Scheduler, NVIDIA Grove, NVIDIA NVSentinel
Performance And Validation LayerBenchmark, size, validate, and continuously refresh the architecture.NVIDIA AIConfigurator, NVIDIA AIPerf, NVIDIA ISV NCP Validation Suite
Architecture Governance LayerCapture proposals, implementation decisions, and architecture change history.NVIDIA Dynamo Enhancement Proposals

Common Component Combinations

CombinationComponentsWhen To Use
Full Stack NCP Inference PlatformNVIDIA AI Cluster Runtime, NVIDIA GPU Operator, NVIDIA Network Operator, KAI Scheduler, NVIDIA Grove, NVIDIA Dynamo, NVIDIA ModelExpress, NVIDIA Inference Xfer Library, NVIDIA AIConfigurator, NVIDIA AIPerf, NVIDIA ISV NCP Validation Suite, NVIDIA NVSentinelA partner wants a repeatable Kubernetes-based inference platform with cluster recipes, GPU and network enablement, disaggregated serving, model-transfer acceleration, benchmarking, validation, and operations.
Large LLM Disaggregated ServingNVIDIA Dynamo, NVIDIA TensorRT-LLM, NVIDIA ModelExpress, NVIDIA Inference Xfer Library, NVIDIA Grove, NVIDIA AIConfigurator, NVIDIA AIPerfA model spans multiple GPUs or nodes and needs independent prefill and decode scale, fast model starts, KV-aware routing, and measured token-service behavior.
Traditional ML And Vision InferenceNVIDIA TensorRT, NVIDIA AITune, NVIDIA Model Optimizer, NVIDIA DALI, NVIDIA GPU Operator, NVIDIA AIPerfA workload is not primarily an LLM service and benefits from graph optimization, TensorRT runtime acceleration, GPU preprocessing, and endpoint-level performance tests.
Model Data Plane AccelerationNVIDIA ModelExpress, NVIDIA Inference Xfer Library, NVIDIA Velo, NVIDIA FlexTensor, Run:ai Model StreamerThe deployment bottleneck is model load time, weight movement, tensor staging, large payload movement, or memory-tier pressure.
Kubernetes Infrastructure BaselineNVIDIA AI Cluster Runtime, NVIDIA GPU Operator, NVIDIA Network Operator, KAI Scheduler, NVIDIA Grove, NVIDIA NVSentinel, NVIDIA ISV NCP Validation SuiteThe platform team needs reproducible cluster configuration, GPU and network operators, GPU-aware scheduling, coordinated placement, health remediation, and acceptance checks before serving workloads.

Visual Architecture Views

The following views show the same inference platform from different operating angles. Use them during design review to confirm that the architecture has a complete layered stack, a clean control-plane and data-plane boundary, explicit network and storage gates, multi-cluster routing semantics, and a validation loop that can reject weak service profiles before production handoff. Each view should be reviewed with the component matrix, source evidence, and release state so the diagram remains an implementation aid rather than a static illustration.

ViewReview QuestionDecision It Should Drive
Layered Inference FactoryCan the team trace each endpoint from tenant API through serving, model data movement, Kubernetes, infrastructure, and validation?Identify missing layers before implementation, especially where model movement, cache policy, scheduling, or validation has no owner.
Control Plane And Data Plane SplitAre policy, planning, scheduling, health, routing, prefill, decode, KV cache, transfer, and runtime responsibilities separated?Place control services away from hot-path GPU workers and define which data-plane services require GPU, NIC, local-NVMe, or topology access.
Network And Storage ReadinessHas the deployment validated tenant access, management, east-west, local GPU scale-up, storage, and local cache behavior independently?Gate service acceptance on network and storage evidence before the first production endpoint is accepted.
Multi-Cluster RoutingCan routing evaluate health, capacity, locality, compliance tags, and cache state across clusters before sending traffic?Define when requests stay local, fail over, or rebalance across clusters, and record which signals make that decision auditable.
Validation GateDoes the release path prove cluster baseline, operator readiness, network, storage, serving, performance, security, and evidence completeness?Promote only service profiles that pass validation, claim verification, and depth review with source evidence attached.

Layered Inference Factory View

Use this view to confirm that every tenant-facing endpoint has a path through platform API, serving orchestration, model data movement, Kubernetes orchestration, accelerated infrastructure, and validation. Missing ownership in any layer should block promotion of the service profile.

Control Plane And Data Plane Split

Use this view to separate policy, planning, scheduling, and health control from the hot path that routes requests, moves weights, manages KV cache, and executes inference. The split should be explicit in namespace design, service-account policy, placement policy, telemetry, and failure handling.

Network And Storage Readiness View

Use this view to validate infrastructure readiness before the serving stack is blamed for endpoint behavior. Tenant access, secure management, east-west fabric, local GPU scale-up, model artifact storage, and node-local NVMe each need independent evidence before a disaggregated inference profile is accepted.

Multi-Cluster Routing View

Use this view when an endpoint can run in more than one cluster, region, or availability zone. The architecture should define the routing inputs, cache-locality signals, health checks, compliance tags, and fallback behavior before endpoint traffic is distributed across clusters.

Validation Gate View

Use this view as the release path for a service profile. A profile should not be published until the cluster baseline, operators, network, storage, serving path, performance targets, security posture, source evidence, claim audit, and depth audit have passed.

Infrastructure-as-a-Service Architecture

The IaaS layer provides the consumable infrastructure underneath the inference platform. An NCP may expose bare metal, virtual machines, managed Kubernetes clusters, or an integrated AI platform. Regardless of the product shape, the IaaS layer must define how GPU nodes are provisioned, sanitized, patched, monitored, placed, and returned to service.

Cloud And Cluster Control Plane

The cloud and cluster control plane should capture tenant intent and turn it into concrete infrastructure operations. For inference, the most important intents are GPU capacity, model endpoint capacity, network capability, storage access, isolation model, service objective, and validation state.

NVIDIA AI Cluster Runtime provides a recipe-driven way to describe known-good combinations of cloud, accelerator, OS, Kubernetes, operators, and workload intent. In this architecture, AI Cluster Runtime is the baseline contract between the infrastructure layer and the inference layer.

Compute Service

The compute service manages GPU nodes, general-purpose control nodes, and any storage or utility nodes that support the platform. It should support inventory, provisioning, firmware policy, OS image policy, node readiness, tenant hand-off, and sanitization.

For inference, compute placement must account for GPU topology, NIC topology, model size, prefill and decode roles, cache locality, endpoint isolation, and failure domains. Full-node allocation is simplest for large LLM services. Smaller services may use MIG or virtualized GPU paths, but those choices must be validated against latency, throughput, and isolation requirements.

Software Defined Network Layer

The network layer must provide tenant isolation and high-performance east-west communication. The tenant access network carries APIs and user traffic. The cluster interconnect carries distributed inference and model data movement. The secure management network carries provisioning and operator control. NVLink provides local GPU scale-up within supported domains.

NVIDIA Network Operator belongs in the Kubernetes layer when RDMA and GPUDirect RDMA components need to be managed as part of the platform. The underlying NCP network control plane still owns fabric-level routing, tenant segmentation, address management, and switch configuration.

Software Defined Storage Layer

The storage layer provides persistent model artifacts, endpoint configuration, benchmark artifacts, logs, traces, metrics, and ephemeral cache. It should expose file, object, block, and local storage options based on workload needs.

The model data-plane components in this architecture do not replace storage. They improve the path from storage to serving workers by coordinating cache state, streaming tensors, moving weights, and staging large payloads across GPU and host memory tiers.

Container-as-a-Service: Kubernetes

Kubernetes is the primary orchestration layer for cloud-native inference workloads. It provides declarative APIs, controllers, scheduling, service discovery, horizontal scaling, namespace isolation, and a consistent packaging model for model-serving services and platform services.

Kubernetes Usage And Personas In ML/AI

AI practitioners need reproducible environments, model endpoints, benchmark feedback, and access to GPU capacity. Developers need stable APIs, deployment workflows, traffic routing, and observability. Platform engineers need GPU enablement, placement controls, tenant boundaries, upgrade workflows, and break-fix automation.

The inference architecture uses Kubernetes for three jobs:

  • Hosting platform control planes and operators.
  • Hosting tenant or endpoint-specific inference workloads.
  • Hosting validation, telemetry, model data-plane, and automation services.

Kubernetes Architecture For Inference

NVIDIA GPU Operator enables GPU support on worker nodes. NVIDIA Network Operator enables RDMA and GPUDirect RDMA networking components where required. KAI Scheduler provides GPU-aware queueing and allocation policy. Grove provides gang scheduling, startup ordering, and topology-aware placement for multi-pod inference units.

These components matter because inference services are often not independent pods. A large service may need routers, prefill workers, decode workers, cache services, sidecars, and model-data services to start and scale coherently. If only part of the service schedules, GPUs can sit idle while the endpoint remains unhealthy.

The Kubernetes Gateway API Inference Extension can be included as a cloud-native interoperability layer when the platform needs Kubernetes-native InferencePool and InferenceModel resources. Treat that extension as an API integration decision, not as a required scheduler, serving backend, or replacement for the Dynamo serving control path.

Operational acceptance should validate CRD installation, operator reconcile health, scheduler events, pod-group placement, GPU allocation, network attachment, and rollback behavior. The platform team should record namespace boundaries, service accounts, node selectors, topology keys, queue policy, and upgrade order before accepting a tenant-facing cluster profile.

AI Platform-as-a-Service

The AI platform layer turns GPU infrastructure into user-facing inference services. It owns endpoint lifecycle, function or task lifecycle, model-serving APIs, routing policy, rate limits, identity hooks, user observability, and integration with model artifacts.

ServiceArchitectural RoleUse In The Inference RA
NVIDIA Cloud FunctionsPlatform control plane and invocation plane for long-running functions and asynchronous tasks.Use above Kubernetes when the NCP needs managed endpoint lifecycle, request routing, artifact access, secrets, multi-cluster integration, and operator-facing APIs.
NVIDIA DynamoDistributed LLM serving orchestration above NVIDIA TensorRT-LLM and compatible backend workers.Use for multi-GPU and multi-node LLM services, disaggregated prefill and decode, KV-aware routing, planner-driven scale, Kubernetes integration, and engine coordination.
NVIDIA TensorRT-LLMLLM inference engine and runtime integration path for NVIDIA GPUs.Use when LLM workloads need engine-level acceleration and can be integrated under Dynamo or another serving control plane.
NVIDIA TensorRTGeneral-purpose inference runtime and optimization path.Use for non-LLM inference services, computer vision, classical ML, or custom models that fit the TensorRT runtime model.

Backend selection is an architectural decision, not only an implementation detail. The platform should maintain a backend disposition matrix that records which model classes use TensorRT-LLM, TensorRT, Dynamo-managed workers, or another compatible backend path; which gaps are accepted; which gaps block production; and which rollback path is available when a runtime release changes token behavior, tool-call behavior, LoRA behavior, or weight-loading behavior.

Multi-cluster routing should separate endpoint routing, cluster routing, model routing, and cache routing. A request should cross clusters only when policy, locality, compliance tags, health, capacity, trace context, and cache state are visible to the routing layer.

Cloud-Native Inference Gateway Integration

CapabilityUse In This RABoundary
Kubernetes Gateway API Inference ExtensionUse as a cloud-native Kubernetes API option for InferencePool, InferenceModel, and gateway-facing model routing policy when the NCP wants standards-based interoperability.Do not treat it as a replacement for Dynamo Router, Planner, KV Block Manager, NIXL, or backend worker orchestration.
KV-aware routingUse routing inputs such as cache state, model identity, worker load, prefill/decode role, locality, and health when the workload benefits from cache reuse.Require telemetry that shows routing decisions and cache behavior before making cache-aware routing a production default.
Model catalog examplesTreat Nemotron, DeepSeek, Kimi, and similar current model families as workload payloads to validate, not as platform dependencies.Keep model revision, tokenizer, license, context length, backend disposition, and benchmark profile in a refreshable validation catalog.

Inference Serving Flow

  1. A user or application calls a platform API or inference endpoint.
  2. The platform authenticates the call, applies policy, and routes it to the endpoint control plane.
  3. The serving layer selects runtime workers and applies routing, batching, prefill, decode, and cache policy.
  4. The model data layer supplies model artifacts, weight transfer, KV or tensor movement, and large-payload staging.
  5. Kubernetes and scheduler components maintain worker placement, readiness, and scale.
  6. Telemetry and benchmark systems compare live behavior with accepted baselines.

Model Optimization And Runtime Preparation

Model optimization is a separate architectural layer because it changes the artifact that enters serving. The optimization path should be selected before final benchmark acceptance, not after a service is already in production.

New model bring-up should use versioned recipes rather than one-off tuning notes. Each recipe should define the quantization path, backend path, serving topology, benchmark profile, accuracy gate, artifact provenance check, and rollback path before it is recommended for a partner service profile.

ComponentWhen It Enters The WorkflowOutput
NVIDIA Model OptimizerBefore deployment when the model needs compression, quantization, or deployment-oriented graph preparation.A model artifact or runtime path that is better aligned to the selected NVIDIA inference backend.
NVIDIA AITuneBefore deployment for PyTorch modules and pipelines that need backend exploration and performance profiling.A tuned model or pipeline plus profiling evidence for the selected backend.
NVIDIA DALIDuring workload design when preprocessing can become a CPU bottleneck.GPU-accelerated data loading and preprocessing stages close to the inference path.
NVIDIA AIConfiguratorDuring Day 0 sizing for disaggregated serving.Candidate prefill, decode, parallelism, backend, and GPU-count configurations to test in the lab.
NVIDIA AIPerfDuring Day 1 acceptance and Day 2 regression testing.Endpoint-level latency, throughput, token, concurrency, and benchmark report data.

Model Data And Memory Architecture

Large-scale inference can bottleneck on model movement, cache locality, GPU memory pressure, and startup time. The model data plane sits between storage and serving workers and should be designed as deliberately as the serving runtime.

ComponentData Plane FunctionDesign Consideration
NVIDIA ModelExpressCoordinates model weight acquisition, cache state, metadata, and peer-to-peer transfer.Use when cold start, duplicate model downloads, or weight fan-out limit scale or autoscaling responsiveness.
NVIDIA Inference Xfer LibraryProvides a low-level transfer layer for inference data movement.Use under model, KV, or cache services that need high-throughput movement across GPU, host, and remote memory tiers.
NVIDIA VeloProvides active messaging, streaming, rendezvous, discovery, queues, and observability primitives.Use inside distributed services that need typed communication and large-payload staging across workers.
NVIDIA FlexTensorProvides tensor discovery, host-resource analysis, and offload strategies.Use when GPU memory pressure requires tensor movement or memory-tier planning.
Run:ai Model StreamerProvides model tensor streaming and benchmark-oriented loading paths.Use as an optional model-loading path when streaming behavior or benchmark evidence supports it.

KV cache ownership, transfer, eviction, recovery, and observability must be explicit in the design. Host memory, local SSD, remote memory, and peer-to-peer transfer should be treated as planned tiers with accepted failure behavior rather than emergency overflow paths. Validation should include cache-aware routing behavior, worker restart behavior, cache-hit-rate tracking, and local-SSD wear review where SSD-backed cache or offload is used.

Model startup should be decomposed into artifact discovery, cache warmup, weight movement, container startup, backend initialization, and first-ready signaling. Record model download time, cache-hit time, peer transfer time, container ready time, backend ready time, first token time, and restart recovery time for every accepted service profile.

Data Flow Diagrams

The inference stack supports multiple data paths, not one monolithic service path. These diagrams show the major GenAI, traditional ML, and deployment flows that operators should validate.

GenAI/LLM Inference Flow

Traditional ML Inference Flow

Model Deployment Flow

Key Component Interactions

Disaggregated LLM Serving

The critical interactions are Planner with Grove for right-sized prefill and decode capacity, Router with the KV Block Manager for cache-aware placement, and KV Block Manager with NIXL for memory-tier movement.

Kubernetes Infrastructure Stack

Component Interaction Matrix

ComponentInteracts WithIntegration Type
NVIDIA DynamoRouter, Planner, TensorRT-LLM, NIXL, ModelExpress, KubernetesRuntime Orchestration
RouterDynamo workers, KV Block Manager, platform gatewayRequest Routing
KV Block ManagerDynamo workers, NIXL, model-serving backendsMemory Management
NIXLKV Block Manager, ModelExpress, GPU and host memory tiersData Transfer
ModelExpressNIXL, model cache, serving backends, metadata storesModel Loading
GroveKAI Scheduler, Dynamo deployment units, topology policyGang Scheduling
KAI SchedulerGrove, GPU Operator, Kubernetes queuesScheduling
GPU OperatorKubernetes nodes, device plugin, DCGM telemetryGPU Enablement
Network OperatorKubernetes nodes, RDMA and GPUDirect RDMA componentsNetwork Enablement
AIConfiguratorDynamo serving configuration, model and hardware profilesPlanning
AIPerfDynamo, platform endpoints, TensorRT servicesBenchmarking
Model OptimizerTensorRT, TensorRT-LLM, model artifactsOptimization

Getting Started

Choose an adoption path based on the first problem the operator needs to solve.

Full Stack Deployment

  1. Define the cluster baseline with NVIDIA AI Cluster Runtime.
  2. Enable GPUs and networking with GPU Operator and Network Operator.
  3. Add KAI Scheduler and Grove for GPU allocation, gang scheduling, and topology-aware placement.
  4. Optimize the model with Model Optimizer, AITune, TensorRT, or TensorRT-LLM.
  5. Plan the serving topology with AIConfigurator.
  6. Deploy Dynamo, ModelExpress, NIXL, and the selected runtime backend.
  7. Validate endpoint behavior with AIPerf and the ISV NCP Validation Suite.
  8. Operate with telemetry, NVSentinel remediation, and release-state review.

Traditional ML Inference Only

  1. Optimize model artifacts with TensorRT, Model Optimizer, AITune, or DALI where relevant.
  2. Deploy the service on GPU-enabled Kubernetes.
  3. Validate latency, throughput, and preprocessing behavior with AIPerf.

GenAI/LLM Inference Only

  1. Select the serving backend and runtime path.
  2. Use AIConfigurator to narrow prefill, decode, backend, and GPU-count choices.
  3. Deploy Dynamo with TensorRT-LLM or the selected backend.
  4. Add KV, NIXL, ModelExpress, Grove, and KAI Scheduler when the service needs multi-node scale or fast model movement.

Kubernetes Integration Only

  1. Install GPU Operator and Network Operator.
  2. Add KAI Scheduler for GPU-aware allocation.
  3. Add Grove when workloads require coordinated multi-pod placement.
  4. Validate the cluster baseline before handing capacity to endpoint teams.

Example Workload: Large MoE LLM Inference

A large mixture-of-experts LLM service is the clearest example of why this architecture is disaggregated. The workload can require separate compute profiles for prefill and decode, cache-aware routing, topology-aware placement, fast model-weight movement, and benchmark-driven configuration.

Core Design Philosophy

The service should separate control plane, request routing, prefill, decode, KV cache management, model artifact movement, and operations. Each subsystem scales against a different bottleneck. Treating them as one monolithic deployment hides the bottleneck and makes capacity planning harder.

Key Architectural Components

  • Dynamo coordinates distributed serving and backend workers.
  • Planner and AIConfigurator narrow the prefill/decode and parallelism choices before broad benchmark runs.
  • Router and KV Block Manager reduce redundant prefill work and manage cache locality.
  • NIXL and ModelExpress accelerate model data and cache movement.
  • Grove and KAI Scheduler keep multi-pod serving units schedulable and topology aware.
  • AIPerf and the validation suite turn the design into acceptance evidence.

Reference Architecture

Deployment Recommendations

  1. Start with disaggregated serving for high-throughput or long-context services.
  2. Keep prefill and decode placement topology aware.
  3. Measure model load, first-ready time, time to first token, inter-token latency, and throughput together.
  4. Use cache-aware routing only with enough telemetry to prove cache locality helps the target workload.
  5. Treat performance claims as workload-specific until the partner validation report records the hardware, model, backend, traffic profile, and versions.

Logical Architecture

  1. Clients call a platform API, an invocation endpoint, or a model-serving endpoint.
  2. The API and serving layer applies routing, authentication hooks, admission policy, and endpoint-level workload controls.
  3. The serving layer selects an inference backend and coordinates request routing, scaling, prefill, decode, and runtime workers.
  4. The model data layer stages model artifacts, streams weights, exchanges large payloads, and manages memory-tier movement where the selected components support those paths.
  5. Kubernetes orchestration places the workload on GPU nodes, configures GPU and network resources, schedules related pods, and exposes health information.
  6. Benchmarking and validation tools measure latency, throughput, startup, configuration, and environment readiness.
  7. Operations tooling feeds health events, remediation workflows, and release-state changes back into the next architecture refresh.

Use this sequence to validate ownership boundaries. Record which component owns each control-plane decision, which component owns each data-plane movement, which telemetry signal proves the transition occurred, and which rollback action returns the service to the last accepted state.

Review the logical flow during every release and require unresolved ownership gaps to block promotion from lab to tenant-facing service.

Physical Architecture

The physical deployment starts with one or more GPU Kubernetes clusters. A production partner environment should map the logical layers to concrete availability zones, clusters, racks, GPU nodes, NICs, storage services, container registries, and operator namespaces. This architecture does not assume one fixed rack shape because the component set can run on cloud, self-managed, and lab environments.

Select the hardware profile before turning topology into requirements. RTX PRO, HGX, and GB300 NVL72 designs have different assumptions for GPU form factor, rack density, local scale-up, power, cooling, network fabric, DPU placement, and tenancy. The software RA should point to the selected Enterprise RA and then record which assumptions apply to the partner profile.

ProfileBest FitArchitecture Direction
NVIDIA RTX PRO AI FactoryAgentic inference, visual computing, physical AI, simulation, data processing, and small or medium LLM services.Use the RTX PRO Enterprise RA to decide PCIe GPU server layout, Spectrum-X networking, BlueField placement, storage, and scalable-unit growth. Do not inherit HGX or NVL72 assumptions by default.
NVIDIA HGX AI FactoryLarge per-node GPU services and multi-node inference profiles that need HGX SXM topology.Use the HGX Enterprise RA to decide GPU POD/Core POD layout, rack networking, ConnectX and BlueField roles, and whether a separate compute fabric is required for the inference profile.
NVIDIA GB300 NVL72 AI FactoryRack-scale models that depend on NVL72 local scale-up, liquid cooling, dual-plane networking, or very large model-parallel execution.Use the GB300 NVL72 Enterprise RA to decide rack-scale NVLink domain, power, cooling, dual-plane networking, tenant posture, and operational acceptance boundaries.

Minimum physical concerns for an implementation review:

  • GPU node type, GPU count, memory capacity, CPU memory, local storage, and PCIe or NVLink topology.
  • East-west network type, RDMA readiness, GPUDirect RDMA readiness, and switch-domain boundaries.
  • Storage source for model artifacts, local cache capacity, shared cache options, and air-gapped behavior.
  • Kubernetes version, container runtime, GPU Operator state, Network Operator state, and scheduler configuration.
  • Placement rules for multi-node model instances, prefill pools, decode pools, control-plane services, and telemetry.

Record the physical bill of materials, firmware baseline, rack and rail topology, storage topology, and management-plane path with the architecture review. Validate that node labels, scheduler topology keys, storage classes, and network attachment definitions represent the physical design rather than hiding it from Kubernetes.

Deployment Model

Day 0 defines the baseline. Use NVIDIA AI Cluster Runtime inputs, cluster snapshots, and operator values to describe the target cloud, accelerator, operating system, Kubernetes version, network shape, and workload intent. Record selected versions, topology assumptions, acceptance criteria, and tenant boundaries before deploying.

Day 1 installs and configures the stack. Enable GPU and network operators, choose scheduling and placement controls, deploy serving and model-data components, generate serving configuration, stage models, and run smoke tests before exposing partner endpoints.

Day 2 operates the environment. Track health, remediation, model cache state, benchmark drift, version changes, and validation results. Review the architecture when component versions change, when deployment evidence changes, or when partner validation expands support boundaries.

Deployment acceptance should require rendered manifests, applied values, image sources, storage classes, network attachments, service endpoints, smoke-test output, and rollback instructions. Record the owner and evidence path for each deployment phase so a failed install can be triaged without reconstructing the environment from terminal history.

Validation Methodology

Platform validation requires lab execution. Start with the selected component combination, then validate the cluster baseline, operator readiness, serving deployment, model-data paths, benchmark behavior, and operations workflows with the local validation suite and benchmarking tools.

The RA should be enforced as an executable acceptance bar. A cluster or service profile should not be described as accepted until validation covers firmware consistency, GPU and network readiness, east-west reachability, north-south redundancy, storage configuration, scheduler behavior, model-serving smoke tests, and the evidence package for any partner-specific support claim.

Validation work should produce:

  • Cluster recipe or baseline evidence.
  • Operator installation and readiness evidence.
  • Model-serving deployment evidence.
  • Benchmark outputs for time to first token, inter-token latency, request latency, throughput, startup, and failure behavior where relevant.
  • Security and operations evidence for identity, secrets, observability, remediation, and upgrade paths.

For each validation run, record the model, tokenizer, backend, container image, hardware, driver, Kubernetes version, network mode, storage path, prompt profile, output profile, concurrency level, and pass or fail threshold. Require failures to identify the owning layer before closing the run.

Performance Guidance

Start with configuration search before exhaustive load testing. Use the configuration tool to narrow prefill, decode, parallelism, backend, and GPU-count choices. Then use endpoint benchmarks to measure token latency, request latency, throughput, and concurrency behavior under realistic traffic. For model-start performance, collect model download, cache-hit, weight-transfer, and first-ready timestamps. For operations, track GPU health events, recovery time, failed placements, and capacity headroom.

Do not compare configurations unless the model, tokenizer, prompt distribution, output length distribution, GPU type, driver, network, serving backend, and concurrency profile are recorded together.

Speculative decoding, custom kernels, quantization, and model-specific tuning should be treated as controlled optimization inputs. Measure acceptance rate, accuracy delta, latency impact, backend compatibility, and rollback behavior before turning an optimization on by default.

Maintain a per-backend performance baseline and compare new results only against baselines that share the same model, hardware, traffic profile, and software versions.

Sizing Guidance

Use software profiles and hardware profiles together. The software profile describes the component stack. The hardware profile describes the GPU node, rack, network, DPU, storage, cooling, and tenancy assumptions that make the software profile valid.

  • Single-node inference: start with TensorRT, TensorRT-LLM, AITune, Model Optimizer, DALI, GPU Operator, and AIPerf.
  • Multi-node LLM inference: add Dynamo, ModelExpress, NIXL, Grove, AIConfigurator, and scheduler controls.
  • Partner platform inference: add NVIDIA Cloud Functions, NVIDIA AI Cluster Runtime, Network Operator, NVSentinel, and the validation suite.
  • RTX PRO profile: use for agentic AI, visual computing, physical AI, simulation, data processing, and small or medium LLM services where PCIe GPU server modularity and enterprise scalability are the starting point.
  • HGX profile: use for large GPU nodes and multi-node LLM profiles that need HGX topology and high-performance cluster networking.
  • GB300 NVL72 profile: use for rack-scale services that need NVL72 local scale-up, liquid cooling, dual-plane networking, and tightly controlled model-parallel placement.

Scale only after measuring the bottleneck. Add GPU capacity for compute saturation, add model-data acceleration for cold-start and artifact movement, add scheduling controls for placement failures, and add network controls for high east-west transfer pressure.

For each profile, document the starting GPU count, target concurrency, prompt and output shape, model size, cache policy, network requirement, storage requirement, and expected autoscaling trigger. Validate the profile with AIPerf or an equivalent endpoint benchmark before copying it to another model family.

Review sizing profiles after each benchmark run and update the accepted profile only when the measured bottleneck, mitigation, and rollback threshold are recorded.

Telemetry And Observability

The inference platform should use the same three observability pillars as the broader NCP software guide: logs, metrics, and traces. In inference, those signals must be correlated with model, endpoint, tenant, GPU, node, scheduler, and network context.

LayerSignalsExpected Use
Application And EndpointRequest count, request latency, token latency, throughput, errors, queue depth, and trace context.Drive service-level objectives and compare benchmark behavior with live traffic.
Serving RuntimeWorker readiness, prefill/decode saturation, KV cache behavior, batch size, model load state, and backend errors.Identify whether bottlenecks sit in routing, runtime workers, cache locality, or model artifact movement.
Kubernetes PlatformPod state, node readiness, scheduler events, placement failures, autoscaler decisions, and operator health.Support Day 2 incident response and capacity planning.
GPU And Network FabricGPU health, GPU utilization, memory use, NVSwitch or NIC health, RDMA status, congestion, and fabric events.Correlate inference symptoms with infrastructure health and network pressure.
Validation And AnalyticsBenchmark outputs, release acceptance reports, long-term trends, and regression results.Keep hot-path operational telemetry separate from cold-path evidence used for planning and audit.

The hot path supports real-time operations: dashboards, alerting, incident response, and service debugging. The cold path supports planning: capacity analysis, regression tracking, cost attribution, validation history, and long-term trend analysis. Keep both paths connected through stable identifiers for tenant, endpoint, model, node, GPU, and request.

Cache-hit-rate, prefill saturation, decode saturation, model-load state, queue depth, backend version, and routing decision context should be first-class operating signals. These metrics are required to distinguish a serving bottleneck from a cache, model-data, scheduler, or network bottleneck.

Break-Fix Architecture

The break-fix system should detect, triage, remediate, validate, and return GPU infrastructure to service with minimal tenant impact. The design should separate actions that happen while a resource is in the tenant domain from actions that happen after the operator pulls the resource back into the infrastructure domain.

PhaseOperator ActionProduct Hooks
DetectCollect in-band and out-of-band health signals from GPU nodes, network devices, Kubernetes, and serving services.NVSentinel, GPU Operator operands, Network Operator state, platform telemetry, and serving metrics.
TriageClassify the failure as service-level, pod-level, node-level, GPU-level, network-level, or storage/model-data-level.Kubernetes events, scheduler events, Dynamo or NVCF service state, AIPerf checks, and validation-suite probes.
RemediateCordon, drain, restart, reset, reprovision, replace, or return the node to an infrastructure operator domain.NVSentinel remediation workflows, operator reconcile loops, and NCP runbooks.
ValidateConfirm node health, operator health, serving readiness, benchmark smoke tests, and tenant impact.ISV NCP Validation Suite, AIPerf, cluster snapshots, and workload-specific tests.
Return To ServiceRe-admit the resource only after it matches the accepted baseline.AI Cluster Runtime recipe checks, Kubernetes readiness, and platform policy controls.

Break-fix policy should be specific about what can happen in place and what requires tenant hand-off. A pod restart, worker replacement, or endpoint scale action may stay inside the tenant domain. GPU reset, firmware remediation, repeated XID errors, network fabric issues, or node reprovisioning may require operator-domain handling.

Node Level Health Checks

Node health checks should run at the level where the signal is visible. GPU driver and device checks run on the host, VM, or container that has the device. Kubernetes health checks run through the cluster. Fabric health checks run through the network control plane. Endpoint checks run through serving APIs. The break-fix control plane should correlate these signals instead of treating them as separate incidents.

Performance Requirements

Inference performance depends on native access to GPUs, low-latency network paths, model artifact availability, scheduling locality, and runtime configuration. The operator should validate performance for each accepted service profile rather than assuming one benchmark generalizes to every model.

Virtual Machine And Container Networking

High-performance inference workers should avoid unnecessary network abstraction on latency-sensitive or high-bandwidth paths. Where the design requires direct NIC access, use SR-IOV, RDMA, GPUDirect RDMA, or equivalent platform mechanisms. Standard CNI networking may still be appropriate for control traffic, user APIs, and lower-volume service calls.

GPU Exposure

Use CaseMethodInference RA Guidance
Exclusive GPU To ContainerNVIDIA Container Toolkit and Kubernetes GPU device allocation.Default path for many Kubernetes inference workers where the tenant or platform owns the full GPU.
Exclusive GPU To VMPCIe passthrough or equivalent virtualization path.Use when tenant isolation, custom OS control, or VM-based platform delivery is required.
Partitioned GPUMIG where supported by the GPU and workload.Use for smaller models or services that fit a partition. Validate isolation, latency, and memory headroom.
Time-Sliced GPUScheduler or vGPU-mediated sharing.Use cautiously for non-critical or low-utilization services. Avoid for latency-sensitive production LLM serving unless validation proves it.

Model And Storage Performance

Model load and cache behavior can dominate service readiness. Measure download time, cache-hit time, disk-to-GPU time, peer-to-peer transfer time, first-ready time, and first-token behavior. Use ModelExpress, NIXL, model streaming, and local cache only where they map to an observed bottleneck.

Serving Performance

Serving tests should record time to first token, inter-token latency, request latency, output throughput, concurrency, error rate, model size, prompt distribution, output distribution, backend, GPU type, and software versions. Use AIConfigurator to reduce the configuration search space and AIPerf to measure endpoint behavior.

Performance acceptance should require a reproducible command or workload definition, captured environment metadata, stored benchmark output, and a pass or fail threshold. Reject comparisons that omit backend version, model artifact revision, network mode, or cache state.

Isolation And Security

Isolation and security must be designed across the infrastructure, Kubernetes, AI platform, model artifact, telemetry, and user-access layers. The goal is to protect tenants from each other, protect platform services from tenant workloads, and keep operator actions auditable.

Security And Compliance

Security review should cover identity boundaries between platform APIs, model-serving services, operators, and cluster agents. Secrets must be scoped by namespace and service account. Model artifact sources should have access controls, provenance checks, and air-gap behavior when needed. Network policy should separate API ingress, control plane, metadata stores, model-data movement, and telemetry. Release reviews should record component revisions, package metadata, container image origins, SBOM availability, and validation outputs.

Workload Isolation

An NCP inference platform should define tenant boundaries at the infrastructure, Kubernetes, platform API, model artifact, telemetry, and network layers. Bare metal gives the strongest node-level isolation. Virtual machines provide a strong cloud abstraction and can improve tenant lifecycle handling. Kubernetes namespaces are useful but should not be the only isolation boundary for high-value multi-tenant GPU services.

For managed Kubernetes and AI platform services, per-tenant control planes or strongly isolated control-plane partitions reduce cross-tenant blast radius. Shared services such as registries, identity services, metadata services, and observability backends must enforce tenant-aware access.

Tenancy should be visible to the platform API, scheduler, router, model-data services, tracing, and telemetry. Validate tenant-aware routing, service-account scope, secret access, model artifact access, trace partitioning, quota behavior, and noisy-neighbor behavior under load.

Boot And Attestation

Operators should maintain trust in the firmware and software running on GPU nodes, DPUs, BMCs, and control-plane systems. Secure boot establishes a chain of signed software. Measured boot records what was loaded. Remote attestation lets a verifier compare measurements with accepted values before a node is handed to tenant workloads.

Shared Responsibility Model

The operator owns infrastructure, physical security, node lifecycle, GPU and network enablement, platform control planes, and tenant isolation. The tenant owns its users, model artifacts, endpoint policy, application configuration, and workload-level security. End users own application credentials, prompt and data handling, and any application logic they deploy on top of the service.

Operations And Lifecycle Management

Run the stack as a versioned platform. Track operator versions, CRDs, inference runtimes, model-data services, scheduler configuration, and validation tooling in the same release review. Use health monitoring and remediation workflows for GPU and NVSwitch faults. Keep benchmark baselines for each accepted configuration, and rerun them after submodule updates, driver changes, Kubernetes upgrades, or network changes.

Release reviews should include top model configurations for each accepted backend, known backend gaps, runtime-specific regressions, and the rollback decision for each service profile. The operations process should treat unsupported feature gaps as explicit disposition items rather than rediscovering them during incidents.

Compatibility Matrix

ComponentLayerVersion Or DescribeSource
NVIDIA Cloud FunctionsAPI And Experience Layerfaf9bb45github.com/NVIDIA/nvcf
NVIDIA DynamoInference Serving Layerai-dynamo 1.2.0github.com/ai-dynamo/dynamo
NVIDIA TensorRT-LLMInference Serving Layerb9e1945a26github.com/NVIDIA/TensorRT-LLM
NVIDIA TensorRTInference Serving Layer5302b28github.com/NVIDIA/TensorRT
NVIDIA Model OptimizerOptimization Layer229ba61github.com/NVIDIA/Model-Optimizer
NVIDIA AITuneOptimization Layerv0.3.0github.com/ai-dynamo/aitune
NVIDIA AIConfiguratorPerformance And Validation Layeraiconfigurator 0.9.0github.com/ai-dynamo/aiconfigurator
NVIDIA AIPerfPerformance And Validation Layeraiperf 0.8.0github.com/ai-dynamo/aiperf
NVIDIA ISV NCP Validation SuitePerformance And Validation Layerisv-ncp-validation-suite 0.6.8github.com/NVIDIA/ISV-NCP-Validation-Suite
NVIDIA ModelExpressModel Data And Memory Layermodelexpress 0.3.0github.com/ai-dynamo/modelexpress
NVIDIA Inference Xfer LibraryModel Data And Memory Layernixl-cu12 1.1.0github.com/ai-dynamo/nixl
NVIDIA VeloModel Data And Memory Layerv0.4.1github.com/ai-dynamo/velo
NVIDIA FlexTensorModel Data And Memory Layerv0.2.0github.com/ai-dynamo/flextensor
Run:ai Model StreamerModel Data And Memory Layer0.15.9github.com/run-ai/runai-model-streamer
NVIDIA DALIOptimization Layer3421a88d1github.com/NVIDIA/DALI
NVIDIA AI Cluster RuntimeCloud Orchestration Layerf548633bgithub.com/NVIDIA/aicr
NVIDIA GPU OperatorCloud Orchestration Layergpu-operator v1.0.0-develgithub.com/NVIDIA/gpu-operator
NVIDIA Network OperatorCloud Orchestration Layer65176c7github.com/Mellanox/network-operator
KAI SchedulerCloud Orchestration Layere755839github.com/kai-scheduler/KAI-Scheduler
NVIDIA GroveCloud Orchestration Layer09463f2github.com/ai-dynamo/grove
NVIDIA NVSentinelCloud Orchestration Layere774181fgithub.com/NVIDIA/NVSentinel
NVIDIA Dynamo Enhancement ProposalsArchitecture Governance Layerdfdfb78github.com/ai-dynamo/enhancements

Design Alternatives And Tradeoffs

  • Full stack versus subset: the full stack gives platform coverage, but a narrower subset is appropriate for single-node or traditional ML services.
  • Aggregated serving versus disaggregated serving: aggregated serving is simpler, while disaggregated serving gives separate control over prefill, decode, and model-data movement.
  • Static placement versus topology-aware scheduling: static placement is easier to reason about, while topology-aware scheduling is better for multi-node and tightly coupled inference units.
  • Shared storage versus peer-to-peer transfer: shared storage is familiar, while peer-to-peer transfer can reduce duplicate downloads and cold-start pressure when the environment supports it.
  • Online benchmarking versus offline configuration search: offline search reduces the test space, but final acceptance still requires measured workload behavior.
  • RTX PRO versus HGX versus GB300 NVL72: each hardware profile has different GPU, rack, network, DPU, power, cooling, and tenancy assumptions, so choose the profile before writing mandatory topology requirements.
  • Direct platform API versus Kubernetes Gateway API Inference Extension: direct platform APIs can keep endpoint behavior fully inside the NCP control plane, while the Gateway API option provides Kubernetes-native InferencePool and InferenceModel integration for teams that need that API surface.
  • Required fabric versus validated fabric decision: large disaggregated services may require high-performance east-west paths, while single-node or pure endpoint profiles may not need the same compute fabric.
  • DPU-enabled design versus host-only design: DPU offload can improve isolation and infrastructure control for selected profiles, but it should be tied to the chosen Enterprise RA and validation evidence.
  • Base RA versus partner implementation: the base RA should name NVIDIA components and cloud-native API standards, while partner-selected ecosystem tools should be recorded as implementation decisions.

Known Limitations

  • This reference architecture is not a substitute for partner lab validation.
  • The compatibility matrix identifies a starting software set, not a universal support statement.
  • Physical topology, performance numbers, and support boundaries must be supplied by the partner validation process.
  • Benchmark results must be tied to the model, hardware, backend, traffic profile, and software versions used in the test.
  • Hardware guidance must be reconciled with the selected NVIDIA Enterprise RA for RTX PRO, HGX, GB300 NVL72, or another approved profile.
  • The Kubernetes Gateway API Inference Extension provides an interoperability path; it does not replace Dynamo serving design decisions, backend disposition, or KV-aware routing validation.
  • Non-NVIDIA implementation software should not become a base RA dependency unless the partner records it as a local decision outside the reference stack.

Next Steps

  1. Select the component combination that matches the target inference service.
  2. Select the hardware profile and confirm which RTX PRO, HGX, GB300 NVL72, cloud-hosted, or lab assumptions apply.
  3. Confirm the cluster baseline, GPU capacity, network fabric, storage, and model artifact sources.
  4. Deploy the selected stack in a lab environment.
  5. Run validation for serving behavior, performance, security controls, and Day 2 operations.
  6. Publish partner-specific constraints, benchmark results, and support boundaries with the final architecture.