NVIDIA + VMware AI-Ready Platform Components

Overview

This reference architecture provides an example deployment of NVIDIA AI Enterprise on the AI-Ready platform from NVIDIA and VMware, and provides some example workloads to showcase the platform’s capabilities. Topics such as hardware, network, and workload topologies will be discussed.

This reference architecture also shows examples of different AI workloads enabled by this accelerated platform, from multi-node training to inferencing.

The VMware + NVIDIA AI-ready platform includes vital enabling technologies from NVIDIA for rapid deployment, management, and scaling of AI workloads and VMware to enable these workloads in a virtualized environment. This platform mainly consists of the following components:

  • VMware vSphere 7 Update 2, Enterprise Plus edition

  • NVIDIA AI Enterprise Software Suite

  • Four node NVIDIA-Certified Systems 2U server cluster which has NVIDIA Ampere GPUs and NVIDIA® networking

This platform can be configured and leveraged as multi-purpose clusters which can run mixed workloads, such as AI and compute-intensive GPU-accelerated workloads, as well as more traditional VDI and graphics-intensive workloads. Implementing a shared common platform for VDI and AI workloads can increase GPU utilization, thereby lowering the Total Cost of Ownership (TCO) and overall efficiency.

../_images/components-01.png

NVIDIA-Certified Systems

NVIDIA-Certified Systems ™ brings together NVIDIA GPUs and NVIDIA networking in servers from leading vendors. These systems conform to NVIDIA’s design best practices and have passed a set of certification tests that validate the best system configurations for performance, manageability, scalability, and security. With NVIDIA-Certified Systems, enterprises can confidently choose performance-optimized servers to power their accelerated computing workloads, both in smaller configurations and at scale.

These systems include:

  • NVIDIA Ampere architecture-based GPUs such as the NVIDIA A100 Tensor Core GPU. The Tensor Core technology included in the Ampere architecture has brought dramatic speedups to AI operations, bringing down training times from weeks to hours and providing massive acceleration to inference.

  • NVIDIA® Mellanox® ConnectX® SmartNICs and the NVIDIA BlueField® data processing unit (DPU) provide a host of software-defined hardware engines for accelerating networking and security. These enable the best of both worlds: best-in-class AI training and inference performance, with all the necessary levels of enterprise data privacy, integrity, and reliability.

Important

  • Multi-GPU support and configuration for NVIDIA AI Enterprise 1.0.

    • Multi-GPU with only NVLink is supported.

    • Multi-GPU without NVLink is not supported.

    • Multi-GPU with NVSwitch is not supported.

  • Due to the vSphere maximum limit of 4 GPUs per VM, Systems with more than 4 GPUs, there is no way to guarantee optimal GPU placement within the same NUMA node for high performance.

  • ACS (Access Control Service) and ATS (Address Translation Service) for AMD CPUs will only be available with vSphere 7.0 U3.

Complete list of NVIDIA-Certified Systems: https://docs.nvidia.com/ngc/ngc-deploy-on-premises/nvidia-certified-systems/index.html.

Getting Started

The following topics will be covered within this NVIDIA AI Enterprise Reference Architecture. These can be used as a starting point which can be built upon depending on your specific Enterprise data center requirements:

  • NVIDIA-Certified System Hardware

  • Network Topology Overview

  • VMware ESXi Deployment Topology

  • Workload Topology

    • Deep Learning Multi-Node Training

    • Deep Learning Inferencing