Hardware Infrastructure#

The hardware design for the AI Factory prioritizes scalability and elasticity, facilitating horizontal scaling of compute with NVIDIA GPUs, high-speed networking with NVIDIA Spectrum-X, and seamless orchestration using enterprise-ready Kubernetes platforms. Following NVIDIA Enterprise Reference Architecture (Enterprise RA) guidance, this foundation enables organizations to deploy from 4 up to 32 nodes (scaling to 256 GPUs or more) with NVIDIA-Certified compute, resilient storage, and optimal network topologies tailored for enterprise-class demands. Each Enterprise RA includes deployment guides, cluster characterization, automated provisioning, and best-practice sizing for a diverse set of AI workloads including AI pre-training, post-training, real-time inference, agent-based analytics, and HPC. Designed for on-premises, single-tenant, Ethernet-based environments, these architectures provide a robust and flexible foundation for modern AI, supporting both high assurance government use cases and dynamic commercial applications.

Enterprise AI — especially for complex agentic systems as deployed across public sector agencies, national security enclaves, and data-sensitive mission environments — demands computational scale and operational agility beyond traditional data center limits. Such systems may execute a mix of sequential workflow, logical reasoning, distributed data analysis, and near-instant model inference. To meet these needs, accelerated computing platforms combine powerful, workload-optimized GPUs and CPUs, high-performance east-west networking, and specialized software stacks. This approach dramatically improves processing speed and energy efficiency, with the Enterprise Reference Architectures orchestrating these resources to ensure repeatable outcomes, adaptable management, and compliance-aligned deployment for diverse and evolving government workloads.

Two flagship NVIDIA accelerated computing platforms that are validated within these reference architectures, offer specific mission profiles and deployment benefits for regulated environments:

  • NVIDIA RTX PRO™ Servers

    Well-suited for inference-heavy workloads, a wide range of use cases including visual computing and HPC, air-gapped operations, and sites with constrained power and cooling. RTX PRO Servers deliver resilient, distributable AI capacity in power-efficient 2–4 GPU nodes, each with up to eight NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, supporting vGPU partitioning and facilitating simpler accreditation boundaries. This approach enables agencies to maintain strict security postures while deploying robust, scalable AI in situations where physical and operational separation is mandatory—for example, supporting air-gapped operations or classified enclaves. RTX PRO 6000 Blackwell is ideal for organizations looking to maximize compute in constrained footprints while supporting mission flexibility and simplified ATO (Authority to Operate) processes.

  • NVIDIA HGX™ B200 and B300

    Designed for centralized, large-scale training, fine-tuning, and elastic resource pools, the HGX B200 and B300 platforms offer eight Blackwell GPUs tightly integrated via NVLink for distributed training and long-context inference. This architecture is optimal for building and iterating on new models in high-assurance, centralized environments, after which models can be exported to distributed RTX PRO 6000 nodes for production inference.

Selected Blackwell products with Confidential Computing enhance privacy and security for sensitive AI workloads, delivering industry-leading performance with nearly identical throughput to unencrypted modes. These platforms extend hardware-based Trusted Execution Environments to GPUs, providing data and code confidentiality and integrity through encrypted CPU-GPU and GPU-GPU transfers. The architecture protects against unauthorized access from system administrators and malicious insiders by isolating workloads within confidential VMs, ensuring only authorized personnel can access code and data. With hardware root of trust, secure attestation, and virtualization-based protection, organizations can securely train, fine-tune, and deploy AI models across multi-GPU configurations while maintaining strict compliance with federal security requirements.

Complementing these confidential computing capabilities, all NVIDIA-Certified systems undergo testing for Trusted Platform Module (TPM) 2.0 compliance. TPM provides an additional hardware based security layer that enables platform integrity verification, disk encryption, and system identification and attestation, capabilities mandated by DoD STIG standards. This security standard ensures that certified systems meet federal requirements from the firmware level up.

Building on this foundation, NVIDIA Enterprise RAs include flexible hardware configurations tailored to workload and connectivity needs. For RTX PRO 6000 and HGX B200 platforms, two prominent designs are:

  • 2-8-5-200: 2 CPUs, 8 GPUs, 5 Network Adapters, and 200 Gb/s east/west average bandwidth

  • 2-8-5-400: 2 CPUs, 8 GPUs, 5 Network Adapters, and 400 Gb/s east/west average bandwidth using ConnectX-8 as a PCIe Switch

For the HGX B300 platform, the optimized configuration is:

  • 2-8-9-800: 2 CPUs, 8 GPUs, 9 Network Adapters, and 800 Gb/s east/west average bandwidth (available in Single Plane and Dual Plane variants)

When selecting the optimal GPU for a government AI Factory, agencies should consider:

  • Inference performance: Efficient operations across FP16, INT8, and newer formats like FP4 and FP6 are essential for low latency inference and throughput in reasoning AI and agentic systems.

  • GPU memory capacity: Large VRAM enables support for the most advanced language models and retrieval augmented generation (RAG) workloads.

  • Scalability and interconnects: Capabilities such as NVLink drive parallelism and fast data movement in large model or hybrid deployments, ensuring seamless scale as mission demands grow.

NVIDIA Enterprise RAs, by providing recommended hardware configurations for accelerated computing and network connectivity, empower government agencies to harness next generation AI for reliable, secure, and future-ready operations. Solutions based on NVIDIA Enterprise RAs are delivered to market by leading system partners such as Cisco, Dell Technologies, HPE, Lenovo, and Supermicro. These solutions have passed the NVIDIA Design Review Board, achieving endorsement across one or more categories, including Infrastructure, Networking Logic, and Software, and provide proven design frameworks for scalable AI factory deployments.