Overview#

This document serves as a comprehensive end-to-end reference architecture and deployment guide for building a robust, accelerated infrastructure dedicated to Physical AI development. We detail the deployment of a highly available Kubernetes cluster using Kubespray on bare metal NVIDIA hardware, specifically leveraging HGX B200 and RTX Pro Servers equipped with Blackwell GPUs.

The deployment culminates in the orchestration of a multi-stage Physical AI workflow using Run:ai for resource management and OSMO for workflow execution. This core workflow focuses on preparing data for Vision-Language-Action (VLA) model fine-tuning, specifically the GR00T-N1.5 model. The pipeline is comprised of six distinct steps: synthetic data generation via MimicGen, conversion of HDF5 data to MP4, photorealistic augmentation using Cosmos Transfer 2.5, reverse conversion to augmented HDF5, conversion to the LeRobot training format, and finally, GR00T-N1.5 fine tuning. This guide ensures users can replicate the complete environment, from initial operating system provisioning to executing and monitoring complex AI blueprints.