This reference architecture describes an inference-focused software stack for NVIDIA Cloud Partners. It helps operators build a cloud-native platform that can host large language model services, multimodal services, traditional machine learning inference, asynchronous GPU tasks, and partner-specific AI platforms on a shared NVIDIA accelerated infrastructure.
The design assumes the operator needs infrastructure that behaves more like a cloud service than a static cluster. Tenants should be able to request endpoints, models, Kubernetes capacity, GPU workers, model storage, and operations support without understanding every physical detail in the data center. Operators still need clear control over placement, isolation, performance, health, and lifecycle management.
The architecture uses a layered approach. It starts with data center assumptions, then shows tenant and operator views, then maps those views to infrastructure, Kubernetes, AI platform services, inference serving, model data movement, validation, telemetry, break-fix, performance, and security.
NVIDIA inference deployments now need to handle large language models, multimodal models, traditional ML services, asynchronous GPU tasks, and platform APIs across many GPU nodes. This architecture defines a repeatable stack for those needs: cluster recipes and operators at the base, topology-aware scheduling and health management in the platform, model and cache movement in the data plane, model-serving orchestration above inference engines, and benchmarking plus validation around the full system.
The architecture defines component roles and integration points for a complete inference platform. Claims that require lab results are framed as validation work to run, not as completed proof.
Inference platforms are no longer a single model server behind a load balancer. Large models can require multi-node placement, separate prefill and decode pools, fast model-weight movement, cache-aware routing, GPU-aware scheduling, and operational remediation when GPU or network health changes. Partner teams need a way to assemble these capabilities without losing traceability to the component versions and design boundaries that define the stack.
The base RA should mandate evidence, ownership, and decision records, not a single provider implementation. Treat partner-selected ecosystem software, physical cluster shape, DPU placement, fabric topology, storage product, gateway implementation, and provisioning tool as architecture decisions unless an NVIDIA component contract requires them.
This architecture assumes an NCP data center built from GPU compute PODs and a core POD. GPU PODs host accelerated tenant work. The core POD hosts shared control planes, storage services, observability backends, registries, validation services, and operator automation. The selected hardware profile should expose the practical domains needed by the inference platform: tenant access, secure management, cluster interconnect where required, and local GPU scale-up where supported.
GPU compute nodes run the endpoint workers, prefill workers, decode workers, model data-plane services, benchmark jobs, and supporting sidecars. The node design should account for GPU memory capacity, CPU memory, local NVMe capacity, NIC count, NIC speed, PCIe topology, and GPU-to-GPU topology. For large models, node placement must be considered part of the architecture because a poor placement decision can turn a valid software stack into a slow service.
The baseline node software includes a supported OS image, container runtime, NVIDIA driver, NVIDIA Container Toolkit, GPU device plugin, DCGM telemetry, and the networking stack required for RDMA or GPUDirect RDMA when those paths are part of the design.
The inference platform uses multiple networks with different responsibilities.
For disaggregated serving, network design must be reviewed together with serving topology. Prefill pools, decode pools, routers, cache services, and model-weight transfer services should be placed so the highest-volume traffic remains on the most appropriate fabric. RDMA, GPUDirect RDMA, separate east-west compute fabric, DPU offload, and dual-plane topology are decisions to validate against the selected RTX PRO, HGX, GB300 NVL72, cloud-hosted, or lab profile.
Before endpoint validation begins, the operator should validate the required network paths independently. The minimum network evidence should cover RDMA readiness where required, GPUDirect RDMA readiness where required, cross-rail behavior where present, congestion behavior, and degradation behavior when a network path fails or becomes saturated.
Inference needs more than one storage tier. Model artifacts often start in object or file storage. Hot models may need node-local cache, shared cache, or GPU-to-GPU transfer paths. Logs, traces, metrics, benchmark outputs, and validation reports need retention policies that are separate from model storage. Local NVMe is useful for image cache, model cache, temporary tensors, and short-lived logs, but tenant data must be sanitized when infrastructure changes ownership.
The architecture should distinguish persistent model artifact storage from ephemeral cache and from telemetry retention. ModelExpress, NIXL, Velo, FlexTensor, and model streaming components sit in this boundary between storage, memory, and runtime workers.
Storage validation should happen before model-serving validation. The operator should record model-load throughput, cache-hit behavior, shared-file-system behavior, local-NVMe behavior, and failure behavior for storage or fabric degradation. Where local SSD is used for cache or KV offload, the design should include wear review and data-sanitization procedures.
At the data center level, the inference platform should be viewed as a set of control planes and data planes. Control planes run in core services or per-tenant platform clusters. Data planes run close to the GPU nodes. The operator should document which services run in the core POD, which run in tenant Kubernetes clusters, which run as DaemonSets, which run as endpoint-specific workers, and which require direct access to GPUs, NICs, or local storage.
From a tenant point of view, the platform exposes AI services, Kubernetes resources, model endpoints, and operational status. The tenant should not need to understand the entire data center, but the platform must preserve enough topology and resource intent to run high-performance inference.
From an operator point of view, the stack is a set of lifecycle systems. The operator builds the cluster baseline, enables GPU and network resources, exposes tenant or platform control planes, schedules workloads, moves model data, monitors health, responds to failures, and validates changes before handing capacity to tenants.
The architecture uses a layered model. Platform APIs and workload endpoints sit above the serving layer. Serving frameworks coordinate inference engines and request flow. Optimization tools prepare models and execution paths. Model data services move weights, tensors, KV cache blocks, and large payloads. Kubernetes infrastructure components provide GPU enablement, networking, scheduling, placement, health, and repeatable recipes. Benchmarking and validation tools close the loop.
The following views show the same inference platform from different operating angles. Use them during design review to confirm that the architecture has a complete layered stack, a clean control-plane and data-plane boundary, explicit network and storage gates, multi-cluster routing semantics, and a validation loop that can reject weak service profiles before production handoff. Each view should be reviewed with the component matrix, source evidence, and release state so the diagram remains an implementation aid rather than a static illustration.
Use this view to confirm that every tenant-facing endpoint has a path through platform API, serving orchestration, model data movement, Kubernetes orchestration, accelerated infrastructure, and validation. Missing ownership in any layer should block promotion of the service profile.
Use this view to separate policy, planning, scheduling, and health control from the hot path that routes requests, moves weights, manages KV cache, and executes inference. The split should be explicit in namespace design, service-account policy, placement policy, telemetry, and failure handling.
Use this view to validate infrastructure readiness before the serving stack is blamed for endpoint behavior. Tenant access, secure management, east-west fabric, local GPU scale-up, model artifact storage, and node-local NVMe each need independent evidence before a disaggregated inference profile is accepted.
Use this view when an endpoint can run in more than one cluster, region, or availability zone. The architecture should define the routing inputs, cache-locality signals, health checks, compliance tags, and fallback behavior before endpoint traffic is distributed across clusters.
Use this view as the release path for a service profile. A profile should not be published until the cluster baseline, operators, network, storage, serving path, performance targets, security posture, source evidence, claim audit, and depth audit have passed.
The IaaS layer provides the consumable infrastructure underneath the inference platform. An NCP may expose bare metal, virtual machines, managed Kubernetes clusters, or an integrated AI platform. Regardless of the product shape, the IaaS layer must define how GPU nodes are provisioned, sanitized, patched, monitored, placed, and returned to service.
The cloud and cluster control plane should capture tenant intent and turn it into concrete infrastructure operations. For inference, the most important intents are GPU capacity, model endpoint capacity, network capability, storage access, isolation model, service objective, and validation state.
NVIDIA AI Cluster Runtime provides a recipe-driven way to describe known-good combinations of cloud, accelerator, OS, Kubernetes, operators, and workload intent. In this architecture, AI Cluster Runtime is the baseline contract between the infrastructure layer and the inference layer.
The compute service manages GPU nodes, general-purpose control nodes, and any storage or utility nodes that support the platform. It should support inventory, provisioning, firmware policy, OS image policy, node readiness, tenant hand-off, and sanitization.
For inference, compute placement must account for GPU topology, NIC topology, model size, prefill and decode roles, cache locality, endpoint isolation, and failure domains. Full-node allocation is simplest for large LLM services. Smaller services may use MIG or virtualized GPU paths, but those choices must be validated against latency, throughput, and isolation requirements.
The network layer must provide tenant isolation and high-performance east-west communication. The tenant access network carries APIs and user traffic. The cluster interconnect carries distributed inference and model data movement. The secure management network carries provisioning and operator control. NVLink provides local GPU scale-up within supported domains.
NVIDIA Network Operator belongs in the Kubernetes layer when RDMA and GPUDirect RDMA components need to be managed as part of the platform. The underlying NCP network control plane still owns fabric-level routing, tenant segmentation, address management, and switch configuration.
The storage layer provides persistent model artifacts, endpoint configuration, benchmark artifacts, logs, traces, metrics, and ephemeral cache. It should expose file, object, block, and local storage options based on workload needs.
The model data-plane components in this architecture do not replace storage. They improve the path from storage to serving workers by coordinating cache state, streaming tensors, moving weights, and staging large payloads across GPU and host memory tiers.
Kubernetes is the primary orchestration layer for cloud-native inference workloads. It provides declarative APIs, controllers, scheduling, service discovery, horizontal scaling, namespace isolation, and a consistent packaging model for model-serving services and platform services.
AI practitioners need reproducible environments, model endpoints, benchmark feedback, and access to GPU capacity. Developers need stable APIs, deployment workflows, traffic routing, and observability. Platform engineers need GPU enablement, placement controls, tenant boundaries, upgrade workflows, and break-fix automation.
The inference architecture uses Kubernetes for three jobs:
NVIDIA GPU Operator enables GPU support on worker nodes. NVIDIA Network Operator enables RDMA and GPUDirect RDMA networking components where required. KAI Scheduler provides GPU-aware queueing and allocation policy. Grove provides gang scheduling, startup ordering, and topology-aware placement for multi-pod inference units.
These components matter because inference services are often not independent pods. A large service may need routers, prefill workers, decode workers, cache services, sidecars, and model-data services to start and scale coherently. If only part of the service schedules, GPUs can sit idle while the endpoint remains unhealthy.
The Kubernetes Gateway API Inference Extension can be included as a cloud-native interoperability layer when the platform needs Kubernetes-native InferencePool and InferenceModel resources. Treat that extension as an API integration decision, not as a required scheduler, serving backend, or replacement for the Dynamo serving control path.
Operational acceptance should validate CRD installation, operator reconcile health, scheduler events, pod-group placement, GPU allocation, network attachment, and rollback behavior. The platform team should record namespace boundaries, service accounts, node selectors, topology keys, queue policy, and upgrade order before accepting a tenant-facing cluster profile.
The AI platform layer turns GPU infrastructure into user-facing inference services. It owns endpoint lifecycle, function or task lifecycle, model-serving APIs, routing policy, rate limits, identity hooks, user observability, and integration with model artifacts.
Backend selection is an architectural decision, not only an implementation detail. The platform should maintain a backend disposition matrix that records which model classes use TensorRT-LLM, TensorRT, Dynamo-managed workers, or another compatible backend path; which gaps are accepted; which gaps block production; and which rollback path is available when a runtime release changes token behavior, tool-call behavior, LoRA behavior, or weight-loading behavior.
Multi-cluster routing should separate endpoint routing, cluster routing, model routing, and cache routing. A request should cross clusters only when policy, locality, compliance tags, health, capacity, trace context, and cache state are visible to the routing layer.
Model optimization is a separate architectural layer because it changes the artifact that enters serving. The optimization path should be selected before final benchmark acceptance, not after a service is already in production.
New model bring-up should use versioned recipes rather than one-off tuning notes. Each recipe should define the quantization path, backend path, serving topology, benchmark profile, accuracy gate, artifact provenance check, and rollback path before it is recommended for a partner service profile.
Large-scale inference can bottleneck on model movement, cache locality, GPU memory pressure, and startup time. The model data plane sits between storage and serving workers and should be designed as deliberately as the serving runtime.
KV cache ownership, transfer, eviction, recovery, and observability must be explicit in the design. Host memory, local SSD, remote memory, and peer-to-peer transfer should be treated as planned tiers with accepted failure behavior rather than emergency overflow paths. Validation should include cache-aware routing behavior, worker restart behavior, cache-hit-rate tracking, and local-SSD wear review where SSD-backed cache or offload is used.
Model startup should be decomposed into artifact discovery, cache warmup, weight movement, container startup, backend initialization, and first-ready signaling. Record model download time, cache-hit time, peer transfer time, container ready time, backend ready time, first token time, and restart recovery time for every accepted service profile.
The inference stack supports multiple data paths, not one monolithic service path. These diagrams show the major GenAI, traditional ML, and deployment flows that operators should validate.
The critical interactions are Planner with Grove for right-sized prefill and decode capacity, Router with the KV Block Manager for cache-aware placement, and KV Block Manager with NIXL for memory-tier movement.
Choose an adoption path based on the first problem the operator needs to solve.
A large mixture-of-experts LLM service is the clearest example of why this architecture is disaggregated. The workload can require separate compute profiles for prefill and decode, cache-aware routing, topology-aware placement, fast model-weight movement, and benchmark-driven configuration.
The service should separate control plane, request routing, prefill, decode, KV cache management, model artifact movement, and operations. Each subsystem scales against a different bottleneck. Treating them as one monolithic deployment hides the bottleneck and makes capacity planning harder.
Use this sequence to validate ownership boundaries. Record which component owns each control-plane decision, which component owns each data-plane movement, which telemetry signal proves the transition occurred, and which rollback action returns the service to the last accepted state.
Review the logical flow during every release and require unresolved ownership gaps to block promotion from lab to tenant-facing service.
The physical deployment starts with one or more GPU Kubernetes clusters. A production partner environment should map the logical layers to concrete availability zones, clusters, racks, GPU nodes, NICs, storage services, container registries, and operator namespaces. This architecture does not assume one fixed rack shape because the component set can run on cloud, self-managed, and lab environments.
Select the hardware profile before turning topology into requirements. RTX PRO, HGX, and GB300 NVL72 designs have different assumptions for GPU form factor, rack density, local scale-up, power, cooling, network fabric, DPU placement, and tenancy. The software RA should point to the selected Enterprise RA and then record which assumptions apply to the partner profile.
Minimum physical concerns for an implementation review:
Record the physical bill of materials, firmware baseline, rack and rail topology, storage topology, and management-plane path with the architecture review. Validate that node labels, scheduler topology keys, storage classes, and network attachment definitions represent the physical design rather than hiding it from Kubernetes.
Day 0 defines the baseline. Use NVIDIA AI Cluster Runtime inputs, cluster snapshots, and operator values to describe the target cloud, accelerator, operating system, Kubernetes version, network shape, and workload intent. Record selected versions, topology assumptions, acceptance criteria, and tenant boundaries before deploying.
Day 1 installs and configures the stack. Enable GPU and network operators, choose scheduling and placement controls, deploy serving and model-data components, generate serving configuration, stage models, and run smoke tests before exposing partner endpoints.
Day 2 operates the environment. Track health, remediation, model cache state, benchmark drift, version changes, and validation results. Review the architecture when component versions change, when deployment evidence changes, or when partner validation expands support boundaries.
Deployment acceptance should require rendered manifests, applied values, image sources, storage classes, network attachments, service endpoints, smoke-test output, and rollback instructions. Record the owner and evidence path for each deployment phase so a failed install can be triaged without reconstructing the environment from terminal history.
Platform validation requires lab execution. Start with the selected component combination, then validate the cluster baseline, operator readiness, serving deployment, model-data paths, benchmark behavior, and operations workflows with the local validation suite and benchmarking tools.
The RA should be enforced as an executable acceptance bar. A cluster or service profile should not be described as accepted until validation covers firmware consistency, GPU and network readiness, east-west reachability, north-south redundancy, storage configuration, scheduler behavior, model-serving smoke tests, and the evidence package for any partner-specific support claim.
Validation work should produce:
For each validation run, record the model, tokenizer, backend, container image, hardware, driver, Kubernetes version, network mode, storage path, prompt profile, output profile, concurrency level, and pass or fail threshold. Require failures to identify the owning layer before closing the run.
Start with configuration search before exhaustive load testing. Use the configuration tool to narrow prefill, decode, parallelism, backend, and GPU-count choices. Then use endpoint benchmarks to measure token latency, request latency, throughput, and concurrency behavior under realistic traffic. For model-start performance, collect model download, cache-hit, weight-transfer, and first-ready timestamps. For operations, track GPU health events, recovery time, failed placements, and capacity headroom.
Do not compare configurations unless the model, tokenizer, prompt distribution, output length distribution, GPU type, driver, network, serving backend, and concurrency profile are recorded together.
Speculative decoding, custom kernels, quantization, and model-specific tuning should be treated as controlled optimization inputs. Measure acceptance rate, accuracy delta, latency impact, backend compatibility, and rollback behavior before turning an optimization on by default.
Maintain a per-backend performance baseline and compare new results only against baselines that share the same model, hardware, traffic profile, and software versions.
Use software profiles and hardware profiles together. The software profile describes the component stack. The hardware profile describes the GPU node, rack, network, DPU, storage, cooling, and tenancy assumptions that make the software profile valid.
Scale only after measuring the bottleneck. Add GPU capacity for compute saturation, add model-data acceleration for cold-start and artifact movement, add scheduling controls for placement failures, and add network controls for high east-west transfer pressure.
For each profile, document the starting GPU count, target concurrency, prompt and output shape, model size, cache policy, network requirement, storage requirement, and expected autoscaling trigger. Validate the profile with AIPerf or an equivalent endpoint benchmark before copying it to another model family.
Review sizing profiles after each benchmark run and update the accepted profile only when the measured bottleneck, mitigation, and rollback threshold are recorded.
The inference platform should use the same three observability pillars as the broader NCP software guide: logs, metrics, and traces. In inference, those signals must be correlated with model, endpoint, tenant, GPU, node, scheduler, and network context.
The hot path supports real-time operations: dashboards, alerting, incident response, and service debugging. The cold path supports planning: capacity analysis, regression tracking, cost attribution, validation history, and long-term trend analysis. Keep both paths connected through stable identifiers for tenant, endpoint, model, node, GPU, and request.
Cache-hit-rate, prefill saturation, decode saturation, model-load state, queue depth, backend version, and routing decision context should be first-class operating signals. These metrics are required to distinguish a serving bottleneck from a cache, model-data, scheduler, or network bottleneck.
The break-fix system should detect, triage, remediate, validate, and return GPU infrastructure to service with minimal tenant impact. The design should separate actions that happen while a resource is in the tenant domain from actions that happen after the operator pulls the resource back into the infrastructure domain.
Break-fix policy should be specific about what can happen in place and what requires tenant hand-off. A pod restart, worker replacement, or endpoint scale action may stay inside the tenant domain. GPU reset, firmware remediation, repeated XID errors, network fabric issues, or node reprovisioning may require operator-domain handling.
Node health checks should run at the level where the signal is visible. GPU driver and device checks run on the host, VM, or container that has the device. Kubernetes health checks run through the cluster. Fabric health checks run through the network control plane. Endpoint checks run through serving APIs. The break-fix control plane should correlate these signals instead of treating them as separate incidents.
Inference performance depends on native access to GPUs, low-latency network paths, model artifact availability, scheduling locality, and runtime configuration. The operator should validate performance for each accepted service profile rather than assuming one benchmark generalizes to every model.
High-performance inference workers should avoid unnecessary network abstraction on latency-sensitive or high-bandwidth paths. Where the design requires direct NIC access, use SR-IOV, RDMA, GPUDirect RDMA, or equivalent platform mechanisms. Standard CNI networking may still be appropriate for control traffic, user APIs, and lower-volume service calls.
Model load and cache behavior can dominate service readiness. Measure download time, cache-hit time, disk-to-GPU time, peer-to-peer transfer time, first-ready time, and first-token behavior. Use ModelExpress, NIXL, model streaming, and local cache only where they map to an observed bottleneck.
Serving tests should record time to first token, inter-token latency, request latency, output throughput, concurrency, error rate, model size, prompt distribution, output distribution, backend, GPU type, and software versions. Use AIConfigurator to reduce the configuration search space and AIPerf to measure endpoint behavior.
Performance acceptance should require a reproducible command or workload definition, captured environment metadata, stored benchmark output, and a pass or fail threshold. Reject comparisons that omit backend version, model artifact revision, network mode, or cache state.
Isolation and security must be designed across the infrastructure, Kubernetes, AI platform, model artifact, telemetry, and user-access layers. The goal is to protect tenants from each other, protect platform services from tenant workloads, and keep operator actions auditable.
Security review should cover identity boundaries between platform APIs, model-serving services, operators, and cluster agents. Secrets must be scoped by namespace and service account. Model artifact sources should have access controls, provenance checks, and air-gap behavior when needed. Network policy should separate API ingress, control plane, metadata stores, model-data movement, and telemetry. Release reviews should record component revisions, package metadata, container image origins, SBOM availability, and validation outputs.
An NCP inference platform should define tenant boundaries at the infrastructure, Kubernetes, platform API, model artifact, telemetry, and network layers. Bare metal gives the strongest node-level isolation. Virtual machines provide a strong cloud abstraction and can improve tenant lifecycle handling. Kubernetes namespaces are useful but should not be the only isolation boundary for high-value multi-tenant GPU services.
For managed Kubernetes and AI platform services, per-tenant control planes or strongly isolated control-plane partitions reduce cross-tenant blast radius. Shared services such as registries, identity services, metadata services, and observability backends must enforce tenant-aware access.
Tenancy should be visible to the platform API, scheduler, router, model-data services, tracing, and telemetry. Validate tenant-aware routing, service-account scope, secret access, model artifact access, trace partitioning, quota behavior, and noisy-neighbor behavior under load.
Operators should maintain trust in the firmware and software running on GPU nodes, DPUs, BMCs, and control-plane systems. Secure boot establishes a chain of signed software. Measured boot records what was loaded. Remote attestation lets a verifier compare measurements with accepted values before a node is handed to tenant workloads.
The operator owns infrastructure, physical security, node lifecycle, GPU and network enablement, platform control planes, and tenant isolation. The tenant owns its users, model artifacts, endpoint policy, application configuration, and workload-level security. End users own application credentials, prompt and data handling, and any application logic they deploy on top of the service.
Run the stack as a versioned platform. Track operator versions, CRDs, inference runtimes, model-data services, scheduler configuration, and validation tooling in the same release review. Use health monitoring and remediation workflows for GPU and NVSwitch faults. Keep benchmark baselines for each accepted configuration, and rerun them after submodule updates, driver changes, Kubernetes upgrades, or network changes.
Release reviews should include top model configurations for each accepted backend, known backend gaps, runtime-specific regressions, and the rollback decision for each service profile. The operations process should treat unsupported feature gaps as explicit disposition items rather than rediscovering them during incidents.