NVIDIA AI Enterprise Solution Guide#
This guide aims to provide guidance on how to set up a high-performance multi-node cluster as virtual machines. Within this guide, you will become familiar with GPUDirect RDMA and ATS while using Docker as the platform for running high-performance multi-node Deep Learning Training. ATS is a VMware PCIe support enhancement in vSphere 7 Update 2. GPUDirect RDMA benefits from ATS and is certified and supported by NVIDIA AI Enterprise.
- Overview
- Compute Workflows
- Requirements
- Getting Started
- Configure NVIDIA ConnectX-6 Dx NIC and Spectrum switch for RoCE
- Enable ATS on VMware ESXi and VMs
- Enable ATS on the NVIDIA ConnectX-6 Dx NIC
- Configure NUMA Affinity for the VMs
- Setup Keyless Entry Between VMs On The Multi-Node Cluster
- Run Sample ResNet-50 Multi-Node Training
- Summary