Introduction#

NVIDIA is helping enterprises build AI factories that are cost effective, scalable and high-performing — equipping them to meet the next industrial revolution. AI factory solutions are becoming available through NVIDIA’s global ecosystem of enterprise partners with NVIDIA-Certified Systems, NVIDIA-Certified Storage, & NVIDIA Networking. These partners offer top-tier hardware, software and data center expertise to mitigate risks and enhance ROI in AI projects. This whitepaper presents the necessary components, including integrations from our ecosystem partners, automation tools, and deployment strategies. This design can be used by our enterprise partners for integrating accelerated computing, high-performance networking, and AI software for successfully building single tenant enterprise ready AI factories.

Terms and Definitions#

Control Plane	The container orchestration layer that exposes the API and interfaces to define, deploy and manage lifecycle of containers.
Worker Host	A bare metal server that is building the foundation for the control plane as well as worker hosts.
Worker Node	A compute resource on a physical host and is the resource that is used by AI Developers to execute their workloads.
Cluster	A set of worker nodes that run containerized applications. Every cluster has at least one worker node.
AI Agent	A system of LLMs that work together to reason about a problem with data and act upon it.

Scope#

The scope of this white paper is to provide guidance for building AI Factories. Architectural best practices leveraging ecosystem partners along with NVIDIA hardware and software are described to provide a starting point for building AI Factories. A broad range of ecosystem partners that provide enterprise commercial offerings are presented.

User Personas#

The following user personas will interact with, administer, or use a system implemented by this design guide.

Ecosystem Partners	This is a third-party partner of hardware, software or system integrator that is NVIDIA certified and integrated into the AI factory reference architecture.
OEMs	Server OEM (Original Equipment Manufacturer) companies that design, manufacture and sell server hardware and components.
ISVs	Independent Software Vendor (ISV) that develops and sells software products.
AI Developer	A person who will use the system and work on implementing AI-driven features in applications, integrating LLMs and writing code to deploy AI functionality in software.
MLOps	A person who will use the system and seeks to increase automation and improve the quality of AI/ML lifecycle from development to deployment to monitoring.
IT Admin	A person who will use the system and helps design, procure and manage the underlying infrastructure, including server, operating systems and storage.
Network Administrator	A person who will use the system and manage network infrastructure, including router, switches, firewalls and security protocols.

Platform Requirements#

Data centers with sufficient power, cooling, space, and network connectivity for high-density GPU systems.
- Refer to the NVIDIA HGX B200 8-GPU and NVIDIA RTX(™) Pro Server Edition Sections of the NVIDIA Enterprise Reference Architecture
Skilled personnel or trained system integrator
- Base hardware installation and software platform provisioning
- Support for ongoing operation and management