Introducing the NVIDIA Enterprise Reference Architectures#

NVIDIA developed Enterprise Reference Architectures (Enterprise RAs) to provide clear and consolidated recommendations for NVIDIA system partners and our joint enterprise customers building AI Factories. By bringing the same technical components from the supercomputing world and packaging them with design recommendations based on decades of experience, NVIDIA’s goal is to eliminate the burden of building these systems from scratch with a streamlined approach for flexible and cost-effective configurations, taking the guesswork and risk out of deployment. This ensures customers have the best experience in terms of performance, utilization, uptime, total cost of ownership (TCO), and supportability. Ultimately, this helps our partners and joint customers to achieve value sooner and maximize the return on their investments.

NVIDIA architects have gained extensive knowledge from the many hours of testing to determine best practices for configuring systems to maximize performance and create a baseline of performance standards. NVIDIA Enterprise RAs represent the shared learnings of common design patterns to help customers avoid the pitfalls NVIDIA experts have encountered with guidance on delivering well-balanced systems in which bottlenecks caused by individual components are minimized. This enables partners and enterprise customers to confidently deliver AI solutions faster, allowing them to focus on running their business rather than fighting deployments. Whether you choose to implement a full-fledged data center using our guidelines or adapt the node configurations with your own networking, these NVIDIA Enterprise RAs provide an invaluable starting point.

Example of shared learning: Transceivers and cabling are paramount in large-scale implementations, and getting these wrong can have a major impact on delivery times and customer experience. An NVIDIA partner deviated from our design recommendations despite our advice to follow the reference architecture. To save costs, they opted for copper cables instead of the recommended transceivers. While copper cables are suitable for many installations, our engineers had previously encountered a heat issue with this configuration at scale. Unfortunately, the partner did not heed our warning and subsequently faced the same heat wall, leading to unnecessary struggles and a longer, more expensive deployment for the customer. Sharing these types of learnings and insights is crucial so that customers understand the potential impact of their design choices as they scale. This helps our partners deliver reliable solutions faster and ensures customers are happy and successful.

Methodology for Bringing Reference Architectures to Market#

NVIDIA has a highly structured approach to introducing reference architectures for new technologies, such as NVIDIA Hopper GPUs, Blackwell GPUs, Grace CPUs, Spectrum-X networking platform, and BlueField architecture-based offerings. For each new technology, NVIDIA provides configuration guides to assist partners in designing, building and deploying optimized system configurations. Partners then can submit these systems for certification through the NVIDIA-Certified systems program. This program involves rigorous testing, including thermal analysis, mechanical stress tests, power consumption evaluations, and signal integrity assessments to ensure the components function optimally within the server design. Furthermore, an NVIDIA-Certified server must pass a comprehensive suite of performance tests covering various workload categories, networking capabilities, security features, and management functionalities across a wide range of applications and use cases.

NVIDIA offers two key reference architecture programs that leverage NVIDIA-Certified servers: NVIDIA Enterprise Reference Architectures and the NVIDIA Cloud Partner (NCP) Reference Architecture.

NVIDIA Enterprise Reference Architecture#

NVIDIA Enterprise Reference Architectures are tailored for enterprise-class deployments, ranging from 32 to 256 GPUs. Depending on the base technology, they include configurations for 4 up to 32 nodes, complete with the appropriate networking topology, switching, and allocations for storage and control plane nodes. Derived from the NCP Reference Architecture but right-sized for enterprise-scale deployments, it provides deployment guides, cluster characterization, provisioning automation using BCMe, and sizing guides for common enterprise AI implementations. NVIDIA Enterprise RAs are designed to support a diverse range of workloads, including fine-tuning, Retrieval-Augmented Generation (RAG), model training, inference, and small-scale High-Performance Computing (HPC) tasks. These designs provide a versatile foundation for enterprise AI with a focus on single-tenant, Ethernet-based environments.

Each reference architecture is designed around an NVIDIA-Certified server that follows a prescriptive design pattern, called a Reference Configuration, to ensure optimal performance when deployed in a cluster. The Reference Configurations standardize the description of compute nodes based on their CPU, GPU, Network, and Bandwidth configurations. The C-G-N-B nomenclature simplifies system selection by clearly defining compute power, networking capabilities, and bandwidth performance where each digit (ex: 2-8-5-200) refers to the ratio of # of sockets (CPUs) - # of GPUs - # of network adaptors (NICs)- average East-West network bandwidth per GPU (GbE) respectively.

With GPU and networking advancements on the horizon, these architectures ensure scalability and future-proofing for enterprise applications.

Table 1. Examples of Reference Configuration Node and Networking Patterns

C-G-N-B Configuration

Description

2-8-9-400

2 CPUs, 8 GPUs, 9 NICs (1 North/South, 8 East/West), 400 GbE per GPU

2-2-3-400

2 CPUs, 2 GPUs, 3 NICs (1 North/South, 2 East/West), 400 GbE per GPU

2-8-5-200

2 CPUs, 8 GPUs, 5 NICs (1 North/South, 4 East/West), 200 GbE per GPU

2-4-3-200

2 CPUs, 4 GPUs, 3 NICs (1 North/South, 2 East/West), 200 GbE per GPU

NVIDIA Cloud Partner (NCP) Reference Architecture#

The existing NCP Reference Architecture is designed for larger-scale foundational model training, starting from 128 nodes and scaling up to over 16,000 GPUs. These architectures are derived from larger-scale HPC and superclusters, which can range up to 100,000 GPUs. The NCP Reference Architecture predominantly relies on NVIDIA’s professional services for deployment. It is intended for use cases such as large language model foundational training and GPU-as-a-service, where a large number of GPUs are made available for customers to rent. This architecture supports both single and multi-tenant environments and uses both InfiniBand and Ethernet. The design is more rigid to ensure that deployments by NVIDIA’s professional services do not require additional engineering time for reconfiguration at customer sites. By adhering to reference architecture, NVIDIA can ensure a consistent and efficient deployment process.

These two reference architecture programs provide a robust starting point for NVIDIA partners and our joint customers, helping them avoid common pitfalls and achieve faster, more efficient AI deployments.

Note

The rest of this paper will be focused on the NVIDIA Enterprise RA program. For larger-scale solutions, please reference the NCP program.