Solution Overview#

Application Layer#

OSMO is the workflow orchestration platform used to run multi-stage Physical AI pipelines.

Prometheus (via kube-prometheus-stack) provides metrics collection and monitoring across the cluster. See Table 1 for version details.

OSMO: Uses persistent volumes backed by Longhorn for workflow state and dataset storage.

Kubernetes (Longhorn v1.7.2): Longhorn provides distributed block storage for the cluster.

Features:

Table 1: Infrastructure Software Support Matrix

Component	Version	Notes
Host Operating System	Ubuntu 24.04 LTS Server
Linux Kernel (Control Planes)	6.8.0-84-generic
Linux Kernel (HGX B200)	6.8.0-85-generic
Linux Kernel (RTX PRO Servers)	6.8.0-85-generic
Kubespray	v2.28.0	Used for Kubernetes deployment
Container Runtime	Containerd 2.0.5
Kubernetes	v1.32.5	Deployed using Kubespray
NVIDIA Driver	580.82.07
CUDA Toolkit	13.0	Included with driver 580.82.07
NVIDIA GPU Operator	v25.3.4	Deployed via Helm chart from NGC
NVIDIA Network Operator	v25.7.0
Longhorn	v1.7.2	Distributed block storage, not used for perf critical operations
Prometheus (kube-prometheus-stack)	82.4.3	Installed via Helm
LoadBalancer (MetalLB)	0.15.3	Used for LoadBalancer service type support
Firmware	nvfwupd-v2.0.8, FW 25.09.1
VBIOS	97.00.6E.00.07	Production support release

Version: v1.32.5 Deployed using Kubespray v2.28.0

High Availability (HA) configuration:
- 3 Control Plane nodes (k8s-1-cp1, k8s-1-cp2, k8s-1-cp3)
- Stacked etcd topology (etcd running on control plane nodes)
- Kube-VIP for control plane load balancing

Worker nodes:

NVIDIA 2-8-5-200 (2 CPUs for system management, 8 PCIe GPUs for acceleration, 5 network adapters/DPUs for high-speed connectivity, 200 Gbps dedicated network bandwidth per GPU for distributed workloads). In the test bed: 2x RTX PRO Servers (rtxprosrv1, rtxprosrv2) each equipped with 8x RTX PRO 6000 Blackwell Server Edition GPUs.
NVIDIA 2-8-9-400 (2 CPUs for system management, 8 NVLink-connected HGX GPUs, 9 network adapters/DPUs for high-speed connectivity, 400 Gbps network bandwidth per GPU for distributed workloads). In the test bed: 1x HGX B200 (b200-1) with 8x B200 GPUs.

Special configuration for RTX PRO 6000 Blackwell Server Edition GPUs:

Kernel module parameter: uvm_disable_hmm=1
Applied via ConfigMap to work around a CUDA validation issue on RTX PRO 6000 Blackwell Server Edition GPUs

CUDA 13.0 (included with driver 580.82.07)

Version: v25.3.4 Deployed via Helm chart from NGC

Components deployed:

Version: v25.7.0

Configuration approach:

Network architecture:

North-South: Standard network architecture
East-West: Simplified setup — all rails from RTX 6K servers face the same switch
Skipped advanced SpectrumX features due to lab topology

Firmware updated: nvfwupd-v2.0.8, FW 25.09.1 VBIOS: 97.00.6E.00.07 (production support release)