Bare Metal Deployment Guide
Bare Metal Deployment Guide (0.1.0)

Prerequisites

GDS requires specific server configurations, file systems, and software. This section outlines the required prerequisites in order to utilize GDS with NVIDIA AI Enterprise.

NVIDIA AI Enterprise leverages NVIDIA-Certified Servers to deliver workloads. A completed list of NVDIIA AI Enterprise Compatible servers can be found in the Compatible Systems List. A subset of these servers can be used with GPUDirect Storage. OEMs, CSPs and ODMs can leverage the NVIDIA GPUDirect Storage Design Guide to design their servers to take advantage of GPUDirect Storage. The Design Guide also helps application developers understand where GPUDirect Storage can bring value to application performance.

Aside from the GPU selection, GDS also depends on a server’s PCI Express subsystem layout in order to operate. One of the key performance elements of this technology is the ability to shorten the data path between a GPU and a network interface card (NIC), which requires a PCI Express switch to be present in the topology between the GPU and NIC. Hence, the systems that support GDS can be broken down into two groups - servers built with a PCIe switch and servers equipped with NVIDIA Converged cards.

Below is an example of a system equipped with a PCI Express switch – a topic that will be covered in detail later – demonstrating the data path between a GPU and a network interface card (NIC) without GDS and then with GDS:

Without GDS:

gds-01.png


The data is first transferred from the GPU to the CUDA driver buffer in the system memory, represented by the green arrow “1”, then the data is copied to the network card buffer still in the system memory (yellow arrow “2”), only then reaching the NIC (blue arrow “3”).

With GDS:

gds-02.png


The dotted line shows the eliminated path to and from the system memory, which is no longer needed now that the NIC can communicate with the GPU memory directly (green arrow).

PCI Switch Systems

In this category, we have servers designed with one or more integrated PCI Express switches to increase the PCIe bandwidth between devices while consuming a smaller number of lanes from the CPUs. In this technology, the GPUs can use the switches as a shortcut to communicate to a network device, freeing bus utilization and CPU resources while also minimizing latency.

Example systems that can be configured with a PCI switch from OEMs and are NVDIAI AI Enterprise Compatible are:

  • Lenovo SR670 V2

  • HPE Apollo 6500

  • Inspur NF5468M6

  • Supermicro SYS-420GP-TNR

Note

These systems are only compatible in bare metal deployments.

Before GDS can be configured, ensure that the server is configured with the correct hardware topology. Not all slots in these servers will be switched, so the exact position in which the adapters are installed matters. See Validate PCI Switch System Topology, to ensure your system is ready for GDS.

Converged Card Systems

Some NVIDIA AI Enterprise compatible systems can be configured for GPUDirect Storage by leveraging NVIDIA’s converged cards like the A30X or A100X. The converged cards are comprised of an ARM processor, a GPU and a networking interface, having an internal PCI Express switch portion to properly connect all these subcomponents – the later feature allows GDS to operate. For GDS with NVIDIA AI Enterprise support, the system must be listed on the NVIDIA AI Enterprise Compatible list.

Any system certified for the A30X and A100X can be configured for GPUDirect Storage. Without a converged card, GDS requires GPUs and NICs to be behind the same PCI Express switch in the topology, which can be verified using lstopo.”

  • IOMMU – Disabled

  • ACS disabled

  • Ubuntu 20.04

  • Ubuntu 22.04

GDS with NVDIA AI Enterprise supports Local NVMe drives and remote file systems with NFS drives for bare metal deployments.

  • Remote

    • NFS

  • Local

    • NVME

      • EXT4

      • XFS

NVIDIA AI Enterprise supports a large number of GPUs for compute, with a subset being able to support GDS.

For PCIe switched layouts, GDS supports Data Center and RTX professional desktop products with compute capability higher than 6. See the list of supported Datacenter GPUs below – a full list can be found here: https://developer.nvidia.com/cuda-gpus#compute. The subset of those GPUs that are also supported with NVIDIA AI Enterprise are:

  • NVIDIA H100

  • NVIDIA A100

  • NVIDIA A40

  • NVIDIA A30

  • NVIDIA A10

  • NVIDIA A16

  • NVIDIA A2

  • NVIDIA T4

  • NVIDIA V100

In deployments without PCI Express switching, the following converged GPUs are supported:

  • NVIDIA A100X

  • NVIDIA A30X

Note

Converged GPUs are also supported when installed in a slot provided through a PCI switch.

As GDS provides a direct network connection from the GPU to storage, sufficient networking bandwidth is critically important. Ensure your system has sufficient modern networking to avoid performance bottlenecks.

Previous Overview
Next Deployment Steps
© Copyright © 2021-2024, NVIDIA Corporation. Last updated on Sep 17, 2024.