AI-Ready Enterprise Platform

Unleashing AI for Every Enterprise

Artificial intelligence (AI) is transforming every industry, whether it’s by improving customer relationships in financial services, streamlining manufacturer supply chains, or helping doctors deliver better outcomes for patients. While most organizations know they need to invest in AI to secure their future, they struggle with finding the strategy and platform that can enable success.

Unlike traditional enterprise applications, AI apps are a relatively recent development for many IT departments. They’re anchored in rapidly evolving, open-source, bleeding-edge code and lack proven approaches that meet the rigors of scaled production settings in enterprises. In fact, Gartner states just 53 percent of projects make it from pilot to production, with the complexity of integrating AI solutions with existing infrastructure among the top three barriers to AI implementation.

AI-Ready Platform from NVIDIA and VMware

VMware and NVIDIA have partnered together to unlock the power of AI for every business by delivering an end-to-end enterprise platform optimized for AI workloads. This integrated platform delivers best-in-class AI software, the NVIDIA AI Enterprise Suite, optimized and exclusively certified for the industry’s leading virtualization platform, VMware vSphere®. Running on NVIDIA-Certified Systems ™, industry-leading accelerated servers, this platform accelerates the speed at which developers can build AI and high-performance data analytics, enables organizations to scale modern workloads on the same VMware vSphere infrastructure they’ve already invested in, and delivers enterprise-class manageability, security and availability.

../_images/components-01.png

NVIDIA AI Enterprise offers the flexibility run AI workloads within VMs but if you want to embrace containers, upstream Kubernetes is also offered. VMware Tanzu support is upcoming and will be available soon. By leveraging Kubernetes, the IT Administrators can automate deployments, scale and manage containerized AI applications and frameworks

NVIDIA AI Enterprise Suite

NVIDIA AI Enterprise is an end-to-end, cloud native suite of AI and data science applications and frameworks optimized and exclusively certified by NVIDIA to run on VMware vSphere with NVIDIA-Certified Systems. It includes key enabling technologies and software from NVIDIA for rapid deployment, management, and scaling of AI workloads in the modern hybrid cloud. NVIDIA AI Enterprise is licensed and supported by NVIDIA.

../_images/components-02.png

The software in the NVIDIA AI Enterprise suite is organized into the following layers:

  • Infrastructure optimization software:

    • NVIDIA AI Enterprise Host Software

    • NVIDIA CUDA Toolkit

    • NVIDIA Magnum IO™ software stack for accelerated data centers

  • AI and data science frameworks:

    • TensorFlow

    • PyTorch

    • NVIDIA Triton Inference Server

    • NVIDIA TensorRT

    • RAPIDS

The AI and data science frameworks are delivered as container images. Containerized software can be run directly with a tool such as Docker.

Why Containers?

One of the many benefits of using containers is installing your application, dependencies, and environment variables one time into the container image; rather than on each system you run on. In addition, the key benefits to using containers also include:

  • Install your application, dependencies, and environment variables one time into the container image; rather than on each system you run on.

  • There is no risk of conflict with libraries that are installed by others.

  • Containers allow the use of multiple different deep learning frameworks, which may have conflicting software dependencies, on the same server.

  • After you build your application into a container, you can run it on many other places, especially servers, without installing any software.

  • Legacy accelerated compute applications can be containerized and deployed on newer systems, on-premise, or the cloud.

  • Specific GPU resources can be allocated to a container for isolation and better performance.

  • You can easily share, collaborate, and test applications across different environments.

  • Multiple instances of a given deep learning framework can be run concurrently, each having one or more specific GPUs assigned.

  • Containers can resolve network-port conflicts between applications by mapping container-ports to specific externally visible ports when launching the container.

Understanding End-To-End AI Workflows

The NVIDIA AI Enterprise software suite provides you with everything you need to deploy and support AI infrastructure on VMware. The graphic below outlines a typical AI workflow and how tools, features and GPUs are deployed.

../_images/ed-02.png

Starting on the top left side, AI practitioners have to prep data before they train the neural network. In order to do this, RAPIDS is a great tool for ML workloads, as well as formatting and labelling data which will be used for training workflows. Once the data is ready, the AI practitioner moves onto training. NVIDIA AI Enterprise offers pre-built, tuned containers for training neural networks with tools such as TensorFlow and PyTorch. Additional information regarding containers are covered in sections below. The AI practitioner can further optimize their newly trained models to be most efficient using NVIDIA’s TensorRT SDK and tools. This is done by fusing layers and eliminating unneeded steps. Finally, once the model is ready for Production at scale, the NVIDIA Triton Inference Server can service incoming inferencing requests. Available as a Docker container, the Triton Inference Server supports both GPU and CPU workloads. It allows for front-end client applications to submit inferencing requests for an AI inference cluster and can service models from an AI model repository.

Please refer to the AI Enterprise Solution Guides to further understand how to implement and deploy these end to end enterprise grade AI pipelines.

Many AI practitioners are looking to scale compute resources to reduce the time it takes to complete the training of a neural network and produce results in real-time. Taking a Multi-GPU approach brings scientists closer to achieving a breakthrough as they can more rapidly experiment with different neural networks and algorithms. As scientists train more and more models, the size and data consumptions can grow significantly.

Deploying virtual GPUs (vGPU) for Deep Learning workflows can be architected using three different approaches within a virtualized environment:

  • Single VM assigned a full or fractionalized-partitioned vGPU

  • Single VM with multiple vGPU devices with NVLink

  • Multiple nodes (VMs)

Models can be small enough to run on one or more GPUs within a server, but as datasets grow, training times grow. This is where multi-node distributed training lends well for many organizations. The goal is to build a model using large datasets which understands patterns and relationships behind the data, rather than just the data itself. This requires an exchange of data between multi-nodes throughout the training process and GPUDirect RDMA with ATS provides high-performance networking between nodes. It is recommended that each organization evaluates their needs and chooses the correct architectural approach for executing Deep Learning workflows. For further detail in order to help assess architectural choices, please refer to the Multi-Node Deep Learning Training Solution Guide.

Supported Hardware and Software

NVIDIA GPUs:

  • NVIDIA A100 40GB

  • NVIDIA A100 HGX 40GB

  • NVIDIA A100 80GB

  • NVIDIA A100 HGX 80GB

  • NVIDIA A40¹

  • NVIDIA A30

  • NVIDIA A10

  • NVIDIA A16

  • NVIDIA RTX A6000²

  • NVIDIA RTX A5000²

  • NVIDIA T4

Note

¹The default mode on NVIDIA A40 is Display Off Mode and supports SR-IOV which is required to run NVIDIA AI Enterprise.

²The default mode on NVIDIA A6000 and A5000 is Display On Mode and needs to be toggled off. To change the mode of a GPU that supports multiple display modes, use the displaymodeselector tool, which you can request from the NVIDIA Display Mode Selector Tool page on the NVIDIA Developer website.

NVIDIA-Certified Systems for NVIDIA GPU Cloud that support the supported NVIDIA GPUs and are also also certified for use with VMware vSphere ESXi hypervisor

Multi-node scaling requires an ethernet NIC that supports RoCE. For best performance, NVIDIA recommends using an NVIDIA® Mellanox® ConnectX®-6 Dx and an NVIDIA A100 GPU in each VM used for multi-node scaling. Please refer to the Sizing guide and the Multi-Node Training solution guide for further information.

Hypervisor software:

  • VMware vSphere Hypervisor (ESXi) Enterprise Plus Edition 7.0 Update 2

  • VMware vCenter Server 7.0 Update 2

Guest operating systems:

  • Ubuntu 20.04 with Linux kernel 5.4.0

NVIDIA AI Enterprise Software Components

Software Components

NVIDIA Release Version

NVIDIA TensorRT

21.07-py3

NVIDIA Triton Inference Server

21.07-py3

NVIDIA RAPIDS

21.08-cuda11.4-ubuntu20.04-py3.8

NVIDIA GPU Operator

v1.8.0

NVIDIA Network Operator

v1.0.0

NVIDIA AI Enterprise Host Software

470.63-esxi7

PyTorch

21.07-py3

TensorFlow 1

21.07-tf1-py3

TensorFlow 2

21.07-tf2-py3