NVIDIA Inference Reference Architecture
Introduction
This reference architecture describes an inference-focused software stack for NVIDIA Cloud Partners. It helps operators build a cloud-native platform that can host large language model services, multimodal services, traditional machine learning inference, asynchronous GPU tasks, and partner-specific AI platforms on a shared NVIDIA accelerated infrastructure.
The design assumes the operator needs infrastructure that behaves more like a cloud service than a static cluster. Tenants should be able to request endpoints, models, Kubernetes capacity, GPU workers, model storage, and operations support without understanding every physical detail in the data center. Operators still need clear control over placement, isolation, performance, health, and lifecycle management.
The architecture uses a layered approach. It starts with data center assumptions, then shows tenant and operator views, then maps those views to infrastructure, Kubernetes, AI platform services, inference serving, model data movement, validation, telemetry, break-fix, performance, and security.
Executive Summary
NVIDIA inference deployments now need to handle large language models, multimodal models, traditional ML services, asynchronous GPU tasks, and platform APIs across many GPU nodes. This architecture defines a repeatable stack for those needs: cluster recipes and operators at the base, topology-aware scheduling and health management in the platform, model and cache movement in the data plane, model-serving orchestration above inference engines, and benchmarking plus validation around the full system.
The architecture defines component roles and integration points for a complete inference platform. Claims that require lab results are framed as validation work to run, not as completed proof.
Target Audience
- NVIDIA Cloud Partner platform teams building inference services on GPU Kubernetes clusters.
- Solution architects mapping NVIDIA software components to partner infrastructure.
- Deployment engineers who need a Day 0, Day 1, and Day 2 path from cluster baseline to serving workload.
- Performance and validation teams that need repeatable source, version, and benchmark evidence.
- Product and field teams that need a current architecture view tied to validated components.
Customer Problem
Inference platforms are no longer a single model server behind a load balancer. Large models can require multi-node placement, separate prefill and decode pools, fast model-weight movement, cache-aware routing, GPU-aware scheduling, and operational remediation when GPU or network health changes. Partner teams need a way to assemble these capabilities without losing traceability to the component versions and design boundaries that define the stack.
Design Goals
- Provide a repeatable full-stack blueprint for partner inference deployments.
- Separate infrastructure, model data, serving, optimization, validation, and operations concerns.
- Make common deployment combinations explicit so teams can adopt the full stack or a narrower subset.
- Document compatibility, validation inputs, deployment flow, and operations boundaries.
- Convert the reference architecture into a repeatable acceptance bar with explicit checks for cluster baseline, network, storage, scheduling, serving, telemetry, and security.
- Make backend disposition, tenancy, routing, model-data movement, cache behavior, and release validation visible instead of leaving them as local implementation choices.
- Require each major section to carry enough implementation detail for an operator to configure, validate, or reject a deployment choice.
- Keep unproven performance, support, and validation claims out of the reader-facing architecture.
Architecture Decision Guardrails
The base RA should mandate evidence, ownership, and decision records, not a single provider implementation. Treat partner-selected ecosystem software, physical cluster shape, DPU placement, fabric topology, storage product, gateway implementation, and provisioning tool as architecture decisions unless an NVIDIA component contract requires them.
Data Center Architecture
This architecture assumes an NCP data center built from GPU compute PODs and a core POD. GPU PODs host accelerated tenant work. The core POD hosts shared control planes, storage services, observability backends, registries, validation services, and operator automation. The selected hardware profile should expose the practical domains needed by the inference platform: tenant access, secure management, cluster interconnect where required, and local GPU scale-up where supported.
GPU Compute Node
GPU compute nodes run the endpoint workers, prefill workers, decode workers, model data-plane services, benchmark jobs, and supporting sidecars. The node design should account for GPU memory capacity, CPU memory, local NVMe capacity, NIC count, NIC speed, PCIe topology, and GPU-to-GPU topology. For large models, node placement must be considered part of the architecture because a poor placement decision can turn a valid software stack into a slow service.
The baseline node software includes a supported OS image, container runtime, NVIDIA driver, NVIDIA Container Toolkit, GPU device plugin, DCGM telemetry, and the networking stack required for RDMA or GPUDirect RDMA when those paths are part of the design.
Networking
The inference platform uses multiple networks with different responsibilities.
- Tenant access traffic carries user APIs, control plane calls, endpoint ingress, storage access, registry access, and shared-service traffic.
- Secure management traffic carries platform administration, out-of-band management, provisioning, and operator automation.
- Cluster interconnect traffic carries east-west model serving traffic, distributed inference coordination, model data movement, and high-volume cache or tensor movement when the workload profile requires it.
- NVLink traffic carries local scale-up GPU communication inside the supported GPU topology.
For disaggregated serving, network design must be reviewed together with serving topology. Prefill pools, decode pools, routers, cache services, and model-weight transfer services should be placed so the highest-volume traffic remains on the most appropriate fabric. RDMA, GPUDirect RDMA, separate east-west compute fabric, DPU offload, and dual-plane topology are decisions to validate against the selected RTX PRO, HGX, GB300 NVL72, cloud-hosted, or lab profile.
Before endpoint validation begins, the operator should validate the required network paths independently. The minimum network evidence should cover RDMA readiness where required, GPUDirect RDMA readiness where required, cross-rail behavior where present, congestion behavior, and degradation behavior when a network path fails or becomes saturated.
Storage
Inference needs more than one storage tier. Model artifacts often start in object or file storage. Hot models may need node-local cache, shared cache, or GPU-to-GPU transfer paths. Logs, traces, metrics, benchmark outputs, and validation reports need retention policies that are separate from model storage. Local NVMe is useful for image cache, model cache, temporary tensors, and short-lived logs, but tenant data must be sanitized when infrastructure changes ownership.
The architecture should distinguish persistent model artifact storage from ephemeral cache and from telemetry retention. ModelExpress, NIXL, Velo, FlexTensor, and model streaming components sit in this boundary between storage, memory, and runtime workers.
Storage validation should happen before model-serving validation. The operator should record model-load throughput, cache-hit behavior, shared-file-system behavior, local-NVMe behavior, and failure behavior for storage or fabric degradation. Where local SSD is used for cache or KV offload, the design should include wear review and data-sanitization procedures.
Data Center View
At the data center level, the inference platform should be viewed as a set of control planes and data planes. Control planes run in core services or per-tenant platform clusters. Data planes run close to the GPU nodes. The operator should document which services run in the core POD, which run in tenant Kubernetes clusters, which run as DaemonSets, which run as endpoint-specific workers, and which require direct access to GPUs, NICs, or local storage.
NCP Inference Reference Architecture Introduction
Tenant Compute View
From a tenant point of view, the platform exposes AI services, Kubernetes resources, model endpoints, and operational status. The tenant should not need to understand the entire data center, but the platform must preserve enough topology and resource intent to run high-performance inference.
Operator View
From an operator point of view, the stack is a set of lifecycle systems. The operator builds the cluster baseline, enables GPU and network resources, exposes tenant or platform control planes, schedules workloads, moves model data, monitors health, responds to failures, and validates changes before handing capacity to tenants.
NCP Inference Reference Architecture
Solution Overview
The architecture uses a layered model. Platform APIs and workload endpoints sit above the serving layer. Serving frameworks coordinate inference engines and request flow. Optimization tools prepare models and execution paths. Model data services move weights, tensors, KV cache blocks, and large payloads. Kubernetes infrastructure components provide GPU enablement, networking, scheduling, placement, health, and repeatable recipes. Benchmarking and validation tools close the loop.
Component Layers
Common Component Combinations
Visual Architecture Views
The following views show the same inference platform from different operating angles. Use them during design review to confirm that the architecture has a complete layered stack, a clean control-plane and data-plane boundary, explicit network and storage gates, multi-cluster routing semantics, and a validation loop that can reject weak service profiles before production handoff. Each view should be reviewed with the component matrix, source evidence, and release state so the diagram remains an implementation aid rather than a static illustration.
Layered Inference Factory View
Use this view to confirm that every tenant-facing endpoint has a path through platform API, serving orchestration, model data movement, Kubernetes orchestration, accelerated infrastructure, and validation. Missing ownership in any layer should block promotion of the service profile.
Control Plane And Data Plane Split
Use this view to separate policy, planning, scheduling, and health control from the hot path that routes requests, moves weights, manages KV cache, and executes inference. The split should be explicit in namespace design, service-account policy, placement policy, telemetry, and failure handling.
Network And Storage Readiness View
Use this view to validate infrastructure readiness before the serving stack is blamed for endpoint behavior. Tenant access, secure management, east-west fabric, local GPU scale-up, model artifact storage, and node-local NVMe each need independent evidence before a disaggregated inference profile is accepted.
Multi-Cluster Routing View
Use this view when an endpoint can run in more than one cluster, region, or availability zone. The architecture should define the routing inputs, cache-locality signals, health checks, compliance tags, and fallback behavior before endpoint traffic is distributed across clusters.
Validation Gate View
Use this view as the release path for a service profile. A profile should not be published until the cluster baseline, operators, network, storage, serving path, performance targets, security posture, source evidence, claim audit, and depth audit have passed.
Infrastructure-as-a-Service Architecture
The IaaS layer provides the consumable infrastructure underneath the inference platform. An NCP may expose bare metal, virtual machines, managed Kubernetes clusters, or an integrated AI platform. Regardless of the product shape, the IaaS layer must define how GPU nodes are provisioned, sanitized, patched, monitored, placed, and returned to service.
Cloud And Cluster Control Plane
The cloud and cluster control plane should capture tenant intent and turn it into concrete infrastructure operations. For inference, the most important intents are GPU capacity, model endpoint capacity, network capability, storage access, isolation model, service objective, and validation state.
NVIDIA AI Cluster Runtime provides a recipe-driven way to describe known-good combinations of cloud, accelerator, OS, Kubernetes, operators, and workload intent. In this architecture, AI Cluster Runtime is the baseline contract between the infrastructure layer and the inference layer.
Compute Service
The compute service manages GPU nodes, general-purpose control nodes, and any storage or utility nodes that support the platform. It should support inventory, provisioning, firmware policy, OS image policy, node readiness, tenant hand-off, and sanitization.
For inference, compute placement must account for GPU topology, NIC topology, model size, prefill and decode roles, cache locality, endpoint isolation, and failure domains. Full-node allocation is simplest for large LLM services. Smaller services may use MIG or virtualized GPU paths, but those choices must be validated against latency, throughput, and isolation requirements.
Software Defined Network Layer
The network layer must provide tenant isolation and high-performance east-west communication. The tenant access network carries APIs and user traffic. The cluster interconnect carries distributed inference and model data movement. The secure management network carries provisioning and operator control. NVLink provides local GPU scale-up within supported domains.
NVIDIA Network Operator belongs in the Kubernetes layer when RDMA and GPUDirect RDMA components need to be managed as part of the platform. The underlying NCP network control plane still owns fabric-level routing, tenant segmentation, address management, and switch configuration.
Software Defined Storage Layer
The storage layer provides persistent model artifacts, endpoint configuration, benchmark artifacts, logs, traces, metrics, and ephemeral cache. It should expose file, object, block, and local storage options based on workload needs.
The model data-plane components in this architecture do not replace storage. They improve the path from storage to serving workers by coordinating cache state, streaming tensors, moving weights, and staging large payloads across GPU and host memory tiers.
Container-as-a-Service: Kubernetes
Kubernetes is the primary orchestration layer for cloud-native inference workloads. It provides declarative APIs, controllers, scheduling, service discovery, horizontal scaling, namespace isolation, and a consistent packaging model for model-serving services and platform services.
Kubernetes Usage And Personas In ML/AI
AI practitioners need reproducible environments, model endpoints, benchmark feedback, and access to GPU capacity. Developers need stable APIs, deployment workflows, traffic routing, and observability. Platform engineers need GPU enablement, placement controls, tenant boundaries, upgrade workflows, and break-fix automation.
The inference architecture uses Kubernetes for three jobs:
- Hosting platform control planes and operators.
- Hosting tenant or endpoint-specific inference workloads.
- Hosting validation, telemetry, model data-plane, and automation services.
Kubernetes Architecture For Inference
NVIDIA GPU Operator enables GPU support on worker nodes. NVIDIA Network Operator enables RDMA and GPUDirect RDMA networking components where required. KAI Scheduler provides GPU-aware queueing and allocation policy. Grove provides gang scheduling, startup ordering, and topology-aware placement for multi-pod inference units.
These components matter because inference services are often not independent pods. A large service may need routers, prefill workers, decode workers, cache services, sidecars, and model-data services to start and scale coherently. If only part of the service schedules, GPUs can sit idle while the endpoint remains unhealthy.
The Kubernetes Gateway API Inference Extension can be included as a cloud-native interoperability layer when the platform needs Kubernetes-native InferencePool and InferenceModel resources. Treat that extension as an API integration decision, not as a required scheduler, serving backend, or replacement for the Dynamo serving control path.
Operational acceptance should validate CRD installation, operator reconcile health, scheduler events, pod-group placement, GPU allocation, network attachment, and rollback behavior. The platform team should record namespace boundaries, service accounts, node selectors, topology keys, queue policy, and upgrade order before accepting a tenant-facing cluster profile.
AI Platform-as-a-Service
The AI platform layer turns GPU infrastructure into user-facing inference services. It owns endpoint lifecycle, function or task lifecycle, model-serving APIs, routing policy, rate limits, identity hooks, user observability, and integration with model artifacts.
Backend selection is an architectural decision, not only an implementation detail. The platform should maintain a backend disposition matrix that records which model classes use TensorRT-LLM, TensorRT, Dynamo-managed workers, or another compatible backend path; which gaps are accepted; which gaps block production; and which rollback path is available when a runtime release changes token behavior, tool-call behavior, LoRA behavior, or weight-loading behavior.
Multi-cluster routing should separate endpoint routing, cluster routing, model routing, and cache routing. A request should cross clusters only when policy, locality, compliance tags, health, capacity, trace context, and cache state are visible to the routing layer.
Cloud-Native Inference Gateway Integration
Inference Serving Flow
- A user or application calls a platform API or inference endpoint.
- The platform authenticates the call, applies policy, and routes it to the endpoint control plane.
- The serving layer selects runtime workers and applies routing, batching, prefill, decode, and cache policy.
- The model data layer supplies model artifacts, weight transfer, KV or tensor movement, and large-payload staging.
- Kubernetes and scheduler components maintain worker placement, readiness, and scale.
- Telemetry and benchmark systems compare live behavior with accepted baselines.
Model Optimization And Runtime Preparation
Model optimization is a separate architectural layer because it changes the artifact that enters serving. The optimization path should be selected before final benchmark acceptance, not after a service is already in production.
New model bring-up should use versioned recipes rather than one-off tuning notes. Each recipe should define the quantization path, backend path, serving topology, benchmark profile, accuracy gate, artifact provenance check, and rollback path before it is recommended for a partner service profile.
Model Data And Memory Architecture
Large-scale inference can bottleneck on model movement, cache locality, GPU memory pressure, and startup time. The model data plane sits between storage and serving workers and should be designed as deliberately as the serving runtime.
KV cache ownership, transfer, eviction, recovery, and observability must be explicit in the design. Host memory, local SSD, remote memory, and peer-to-peer transfer should be treated as planned tiers with accepted failure behavior rather than emergency overflow paths. Validation should include cache-aware routing behavior, worker restart behavior, cache-hit-rate tracking, and local-SSD wear review where SSD-backed cache or offload is used.
Model startup should be decomposed into artifact discovery, cache warmup, weight movement, container startup, backend initialization, and first-ready signaling. Record model download time, cache-hit time, peer transfer time, container ready time, backend ready time, first token time, and restart recovery time for every accepted service profile.
Data Flow Diagrams
The inference stack supports multiple data paths, not one monolithic service path. These diagrams show the major GenAI, traditional ML, and deployment flows that operators should validate.
GenAI/LLM Inference Flow
Traditional ML Inference Flow
Model Deployment Flow
Key Component Interactions
Disaggregated LLM Serving
The critical interactions are Planner with Grove for right-sized prefill and decode capacity, Router with the KV Block Manager for cache-aware placement, and KV Block Manager with NIXL for memory-tier movement.
Kubernetes Infrastructure Stack
Component Interaction Matrix
Getting Started
Choose an adoption path based on the first problem the operator needs to solve.
Full Stack Deployment
- Define the cluster baseline with NVIDIA AI Cluster Runtime.
- Enable GPUs and networking with GPU Operator and Network Operator.
- Add KAI Scheduler and Grove for GPU allocation, gang scheduling, and topology-aware placement.
- Optimize the model with Model Optimizer, AITune, TensorRT, or TensorRT-LLM.
- Plan the serving topology with AIConfigurator.
- Deploy Dynamo, ModelExpress, NIXL, and the selected runtime backend.
- Validate endpoint behavior with AIPerf and the ISV NCP Validation Suite.
- Operate with telemetry, NVSentinel remediation, and release-state review.
Traditional ML Inference Only
- Optimize model artifacts with TensorRT, Model Optimizer, AITune, or DALI where relevant.
- Deploy the service on GPU-enabled Kubernetes.
- Validate latency, throughput, and preprocessing behavior with AIPerf.
GenAI/LLM Inference Only
- Select the serving backend and runtime path.
- Use AIConfigurator to narrow prefill, decode, backend, and GPU-count choices.
- Deploy Dynamo with TensorRT-LLM or the selected backend.
- Add KV, NIXL, ModelExpress, Grove, and KAI Scheduler when the service needs multi-node scale or fast model movement.
Kubernetes Integration Only
- Install GPU Operator and Network Operator.
- Add KAI Scheduler for GPU-aware allocation.
- Add Grove when workloads require coordinated multi-pod placement.
- Validate the cluster baseline before handing capacity to endpoint teams.
Example Workload: Large MoE LLM Inference
A large mixture-of-experts LLM service is the clearest example of why this architecture is disaggregated. The workload can require separate compute profiles for prefill and decode, cache-aware routing, topology-aware placement, fast model-weight movement, and benchmark-driven configuration.
Core Design Philosophy
The service should separate control plane, request routing, prefill, decode, KV cache management, model artifact movement, and operations. Each subsystem scales against a different bottleneck. Treating them as one monolithic deployment hides the bottleneck and makes capacity planning harder.
Key Architectural Components
- Dynamo coordinates distributed serving and backend workers.
- Planner and AIConfigurator narrow the prefill/decode and parallelism choices before broad benchmark runs.
- Router and KV Block Manager reduce redundant prefill work and manage cache locality.
- NIXL and ModelExpress accelerate model data and cache movement.
- Grove and KAI Scheduler keep multi-pod serving units schedulable and topology aware.
- AIPerf and the validation suite turn the design into acceptance evidence.
Reference Architecture
Deployment Recommendations
- Start with disaggregated serving for high-throughput or long-context services.
- Keep prefill and decode placement topology aware.
- Measure model load, first-ready time, time to first token, inter-token latency, and throughput together.
- Use cache-aware routing only with enough telemetry to prove cache locality helps the target workload.
- Treat performance claims as workload-specific until the partner validation report records the hardware, model, backend, traffic profile, and versions.
Logical Architecture
- Clients call a platform API, an invocation endpoint, or a model-serving endpoint.
- The API and serving layer applies routing, authentication hooks, admission policy, and endpoint-level workload controls.
- The serving layer selects an inference backend and coordinates request routing, scaling, prefill, decode, and runtime workers.
- The model data layer stages model artifacts, streams weights, exchanges large payloads, and manages memory-tier movement where the selected components support those paths.
- Kubernetes orchestration places the workload on GPU nodes, configures GPU and network resources, schedules related pods, and exposes health information.
- Benchmarking and validation tools measure latency, throughput, startup, configuration, and environment readiness.
- Operations tooling feeds health events, remediation workflows, and release-state changes back into the next architecture refresh.
Use this sequence to validate ownership boundaries. Record which component owns each control-plane decision, which component owns each data-plane movement, which telemetry signal proves the transition occurred, and which rollback action returns the service to the last accepted state.
Review the logical flow during every release and require unresolved ownership gaps to block promotion from lab to tenant-facing service.
Physical Architecture
The physical deployment starts with one or more GPU Kubernetes clusters. A production partner environment should map the logical layers to concrete availability zones, clusters, racks, GPU nodes, NICs, storage services, container registries, and operator namespaces. This architecture does not assume one fixed rack shape because the component set can run on cloud, self-managed, and lab environments.
Select the hardware profile before turning topology into requirements. RTX PRO, HGX, and GB300 NVL72 designs have different assumptions for GPU form factor, rack density, local scale-up, power, cooling, network fabric, DPU placement, and tenancy. The software RA should point to the selected Enterprise RA and then record which assumptions apply to the partner profile.
Minimum physical concerns for an implementation review:
- GPU node type, GPU count, memory capacity, CPU memory, local storage, and PCIe or NVLink topology.
- East-west network type, RDMA readiness, GPUDirect RDMA readiness, and switch-domain boundaries.
- Storage source for model artifacts, local cache capacity, shared cache options, and air-gapped behavior.
- Kubernetes version, container runtime, GPU Operator state, Network Operator state, and scheduler configuration.
- Placement rules for multi-node model instances, prefill pools, decode pools, control-plane services, and telemetry.
Record the physical bill of materials, firmware baseline, rack and rail topology, storage topology, and management-plane path with the architecture review. Validate that node labels, scheduler topology keys, storage classes, and network attachment definitions represent the physical design rather than hiding it from Kubernetes.
Deployment Model
Day 0 defines the baseline. Use NVIDIA AI Cluster Runtime inputs, cluster snapshots, and operator values to describe the target cloud, accelerator, operating system, Kubernetes version, network shape, and workload intent. Record selected versions, topology assumptions, acceptance criteria, and tenant boundaries before deploying.
Day 1 installs and configures the stack. Enable GPU and network operators, choose scheduling and placement controls, deploy serving and model-data components, generate serving configuration, stage models, and run smoke tests before exposing partner endpoints.
Day 2 operates the environment. Track health, remediation, model cache state, benchmark drift, version changes, and validation results. Review the architecture when component versions change, when deployment evidence changes, or when partner validation expands support boundaries.
Deployment acceptance should require rendered manifests, applied values, image sources, storage classes, network attachments, service endpoints, smoke-test output, and rollback instructions. Record the owner and evidence path for each deployment phase so a failed install can be triaged without reconstructing the environment from terminal history.
Validation Methodology
Platform validation requires lab execution. Start with the selected component combination, then validate the cluster baseline, operator readiness, serving deployment, model-data paths, benchmark behavior, and operations workflows with the local validation suite and benchmarking tools.
The RA should be enforced as an executable acceptance bar. A cluster or service profile should not be described as accepted until validation covers firmware consistency, GPU and network readiness, east-west reachability, north-south redundancy, storage configuration, scheduler behavior, model-serving smoke tests, and the evidence package for any partner-specific support claim.
Validation work should produce:
- Cluster recipe or baseline evidence.
- Operator installation and readiness evidence.
- Model-serving deployment evidence.
- Benchmark outputs for time to first token, inter-token latency, request latency, throughput, startup, and failure behavior where relevant.
- Security and operations evidence for identity, secrets, observability, remediation, and upgrade paths.
For each validation run, record the model, tokenizer, backend, container image, hardware, driver, Kubernetes version, network mode, storage path, prompt profile, output profile, concurrency level, and pass or fail threshold. Require failures to identify the owning layer before closing the run.
Performance Guidance
Start with configuration search before exhaustive load testing. Use the configuration tool to narrow prefill, decode, parallelism, backend, and GPU-count choices. Then use endpoint benchmarks to measure token latency, request latency, throughput, and concurrency behavior under realistic traffic. For model-start performance, collect model download, cache-hit, weight-transfer, and first-ready timestamps. For operations, track GPU health events, recovery time, failed placements, and capacity headroom.
Do not compare configurations unless the model, tokenizer, prompt distribution, output length distribution, GPU type, driver, network, serving backend, and concurrency profile are recorded together.
Speculative decoding, custom kernels, quantization, and model-specific tuning should be treated as controlled optimization inputs. Measure acceptance rate, accuracy delta, latency impact, backend compatibility, and rollback behavior before turning an optimization on by default.
Maintain a per-backend performance baseline and compare new results only against baselines that share the same model, hardware, traffic profile, and software versions.
Sizing Guidance
Use software profiles and hardware profiles together. The software profile describes the component stack. The hardware profile describes the GPU node, rack, network, DPU, storage, cooling, and tenancy assumptions that make the software profile valid.
- Single-node inference: start with TensorRT, TensorRT-LLM, AITune, Model Optimizer, DALI, GPU Operator, and AIPerf.
- Multi-node LLM inference: add Dynamo, ModelExpress, NIXL, Grove, AIConfigurator, and scheduler controls.
- Partner platform inference: add NVIDIA Cloud Functions, NVIDIA AI Cluster Runtime, Network Operator, NVSentinel, and the validation suite.
- RTX PRO profile: use for agentic AI, visual computing, physical AI, simulation, data processing, and small or medium LLM services where PCIe GPU server modularity and enterprise scalability are the starting point.
- HGX profile: use for large GPU nodes and multi-node LLM profiles that need HGX topology and high-performance cluster networking.
- GB300 NVL72 profile: use for rack-scale services that need NVL72 local scale-up, liquid cooling, dual-plane networking, and tightly controlled model-parallel placement.
Scale only after measuring the bottleneck. Add GPU capacity for compute saturation, add model-data acceleration for cold-start and artifact movement, add scheduling controls for placement failures, and add network controls for high east-west transfer pressure.
For each profile, document the starting GPU count, target concurrency, prompt and output shape, model size, cache policy, network requirement, storage requirement, and expected autoscaling trigger. Validate the profile with AIPerf or an equivalent endpoint benchmark before copying it to another model family.
Review sizing profiles after each benchmark run and update the accepted profile only when the measured bottleneck, mitigation, and rollback threshold are recorded.
Telemetry And Observability
The inference platform should use the same three observability pillars as the broader NCP software guide: logs, metrics, and traces. In inference, those signals must be correlated with model, endpoint, tenant, GPU, node, scheduler, and network context.
The hot path supports real-time operations: dashboards, alerting, incident response, and service debugging. The cold path supports planning: capacity analysis, regression tracking, cost attribution, validation history, and long-term trend analysis. Keep both paths connected through stable identifiers for tenant, endpoint, model, node, GPU, and request.
Cache-hit-rate, prefill saturation, decode saturation, model-load state, queue depth, backend version, and routing decision context should be first-class operating signals. These metrics are required to distinguish a serving bottleneck from a cache, model-data, scheduler, or network bottleneck.
Break-Fix Architecture
The break-fix system should detect, triage, remediate, validate, and return GPU infrastructure to service with minimal tenant impact. The design should separate actions that happen while a resource is in the tenant domain from actions that happen after the operator pulls the resource back into the infrastructure domain.
Break-fix policy should be specific about what can happen in place and what requires tenant hand-off. A pod restart, worker replacement, or endpoint scale action may stay inside the tenant domain. GPU reset, firmware remediation, repeated XID errors, network fabric issues, or node reprovisioning may require operator-domain handling.
Node Level Health Checks
Node health checks should run at the level where the signal is visible. GPU driver and device checks run on the host, VM, or container that has the device. Kubernetes health checks run through the cluster. Fabric health checks run through the network control plane. Endpoint checks run through serving APIs. The break-fix control plane should correlate these signals instead of treating them as separate incidents.
Performance Requirements
Inference performance depends on native access to GPUs, low-latency network paths, model artifact availability, scheduling locality, and runtime configuration. The operator should validate performance for each accepted service profile rather than assuming one benchmark generalizes to every model.
Virtual Machine And Container Networking
High-performance inference workers should avoid unnecessary network abstraction on latency-sensitive or high-bandwidth paths. Where the design requires direct NIC access, use SR-IOV, RDMA, GPUDirect RDMA, or equivalent platform mechanisms. Standard CNI networking may still be appropriate for control traffic, user APIs, and lower-volume service calls.
GPU Exposure
Model And Storage Performance
Model load and cache behavior can dominate service readiness. Measure download time, cache-hit time, disk-to-GPU time, peer-to-peer transfer time, first-ready time, and first-token behavior. Use ModelExpress, NIXL, model streaming, and local cache only where they map to an observed bottleneck.
Serving Performance
Serving tests should record time to first token, inter-token latency, request latency, output throughput, concurrency, error rate, model size, prompt distribution, output distribution, backend, GPU type, and software versions. Use AIConfigurator to reduce the configuration search space and AIPerf to measure endpoint behavior.
Performance acceptance should require a reproducible command or workload definition, captured environment metadata, stored benchmark output, and a pass or fail threshold. Reject comparisons that omit backend version, model artifact revision, network mode, or cache state.
Isolation And Security
Isolation and security must be designed across the infrastructure, Kubernetes, AI platform, model artifact, telemetry, and user-access layers. The goal is to protect tenants from each other, protect platform services from tenant workloads, and keep operator actions auditable.
Security And Compliance
Security review should cover identity boundaries between platform APIs, model-serving services, operators, and cluster agents. Secrets must be scoped by namespace and service account. Model artifact sources should have access controls, provenance checks, and air-gap behavior when needed. Network policy should separate API ingress, control plane, metadata stores, model-data movement, and telemetry. Release reviews should record component revisions, package metadata, container image origins, SBOM availability, and validation outputs.
Workload Isolation
An NCP inference platform should define tenant boundaries at the infrastructure, Kubernetes, platform API, model artifact, telemetry, and network layers. Bare metal gives the strongest node-level isolation. Virtual machines provide a strong cloud abstraction and can improve tenant lifecycle handling. Kubernetes namespaces are useful but should not be the only isolation boundary for high-value multi-tenant GPU services.
For managed Kubernetes and AI platform services, per-tenant control planes or strongly isolated control-plane partitions reduce cross-tenant blast radius. Shared services such as registries, identity services, metadata services, and observability backends must enforce tenant-aware access.
Tenancy should be visible to the platform API, scheduler, router, model-data services, tracing, and telemetry. Validate tenant-aware routing, service-account scope, secret access, model artifact access, trace partitioning, quota behavior, and noisy-neighbor behavior under load.
Boot And Attestation
Operators should maintain trust in the firmware and software running on GPU nodes, DPUs, BMCs, and control-plane systems. Secure boot establishes a chain of signed software. Measured boot records what was loaded. Remote attestation lets a verifier compare measurements with accepted values before a node is handed to tenant workloads.
Shared Responsibility Model
The operator owns infrastructure, physical security, node lifecycle, GPU and network enablement, platform control planes, and tenant isolation. The tenant owns its users, model artifacts, endpoint policy, application configuration, and workload-level security. End users own application credentials, prompt and data handling, and any application logic they deploy on top of the service.
Operations And Lifecycle Management
Run the stack as a versioned platform. Track operator versions, CRDs, inference runtimes, model-data services, scheduler configuration, and validation tooling in the same release review. Use health monitoring and remediation workflows for GPU and NVSwitch faults. Keep benchmark baselines for each accepted configuration, and rerun them after submodule updates, driver changes, Kubernetes upgrades, or network changes.
Release reviews should include top model configurations for each accepted backend, known backend gaps, runtime-specific regressions, and the rollback decision for each service profile. The operations process should treat unsupported feature gaps as explicit disposition items rather than rediscovering them during incidents.
Compatibility Matrix
Design Alternatives And Tradeoffs
- Full stack versus subset: the full stack gives platform coverage, but a narrower subset is appropriate for single-node or traditional ML services.
- Aggregated serving versus disaggregated serving: aggregated serving is simpler, while disaggregated serving gives separate control over prefill, decode, and model-data movement.
- Static placement versus topology-aware scheduling: static placement is easier to reason about, while topology-aware scheduling is better for multi-node and tightly coupled inference units.
- Shared storage versus peer-to-peer transfer: shared storage is familiar, while peer-to-peer transfer can reduce duplicate downloads and cold-start pressure when the environment supports it.
- Online benchmarking versus offline configuration search: offline search reduces the test space, but final acceptance still requires measured workload behavior.
- RTX PRO versus HGX versus GB300 NVL72: each hardware profile has different GPU, rack, network, DPU, power, cooling, and tenancy assumptions, so choose the profile before writing mandatory topology requirements.
- Direct platform API versus Kubernetes Gateway API Inference Extension: direct platform APIs can keep endpoint behavior fully inside the NCP control plane, while the Gateway API option provides Kubernetes-native InferencePool and InferenceModel integration for teams that need that API surface.
- Required fabric versus validated fabric decision: large disaggregated services may require high-performance east-west paths, while single-node or pure endpoint profiles may not need the same compute fabric.
- DPU-enabled design versus host-only design: DPU offload can improve isolation and infrastructure control for selected profiles, but it should be tied to the chosen Enterprise RA and validation evidence.
- Base RA versus partner implementation: the base RA should name NVIDIA components and cloud-native API standards, while partner-selected ecosystem tools should be recorded as implementation decisions.
Known Limitations
- This reference architecture is not a substitute for partner lab validation.
- The compatibility matrix identifies a starting software set, not a universal support statement.
- Physical topology, performance numbers, and support boundaries must be supplied by the partner validation process.
- Benchmark results must be tied to the model, hardware, backend, traffic profile, and software versions used in the test.
- Hardware guidance must be reconciled with the selected NVIDIA Enterprise RA for RTX PRO, HGX, GB300 NVL72, or another approved profile.
- The Kubernetes Gateway API Inference Extension provides an interoperability path; it does not replace Dynamo serving design decisions, backend disposition, or KV-aware routing validation.
- Non-NVIDIA implementation software should not become a base RA dependency unless the partner records it as a local decision outside the reference stack.
Next Steps
- Select the component combination that matches the target inference service.
- Select the hardware profile and confirm which RTX PRO, HGX, GB300 NVL72, cloud-hosted, or lab assumptions apply.
- Confirm the cluster baseline, GPU capacity, network fabric, storage, and model artifact sources.
- Deploy the selected stack in a lab environment.
- Run validation for serving behavior, performance, security controls, and Day 2 operations.
- Publish partner-specific constraints, benchmark results, and support boundaries with the final architecture.