AI-Ready Enterprise Platform

Important

NVIDIA AI Enterprise is currently available for Early Access only. Join the Early Access Program HERE. This documentation subject to change.

Unleashing AI for Every Enterprise

Artificial intelligence (AI) is transforming every industry, whether it’s by improving customer relationships in financial services, streamlining manufacturer supply chains, or helping doctors deliver better outcomes for patients. While most organizations know they need to invest in AI to secure their future, they struggle with finding the strategy and platform that can enable success.

Unlike traditional enterprise applications, AI apps are a relatively recent development for many IT departments. They’re anchored in rapidly evolving, open-source, bleeding-edge code and lack proven approaches that meet the rigors of scaled production settings in enterprises. In fact, Gartner states just 53 percent of projects make it from pilot to production, with the complexity of integrating AI solutions with existing infrastructure among the top three barriers to AI implementation.

AI-Ready Platform from NVIDIA and VMware

VMware and NVIDIA have partnered together to unlock the power of AI for every business by delivering an end-to-end enterprise platform optimized for AI workloads. This integrated platform delivers best-in-class AI software, the NVIDIA AI Enterprise Suite, optimized and exclusively certified for the industry’s leading virtualization platform, VMware vSphere®. Running on NVIDIA-Certified Systems™, industry-leading accelerated servers, this platform accelerates the speed at which developers can build AI and high-performance data analytics, enables organizations to scale modern workloads on the same VMware vSphere infrastructure they’ve already invested in, and delivers enterprise-class manageability, security and availability.

_images/components-01.png

NVIDIA AI Enterprise Suite

NVIDIA AI Enterprise is an end-to-end, cloud native suite of AI and data science applications and frameworks optimized and exclusively certified by NVIDIA to run on VMware vSphere with NVIDIA-Certified Systems. It includes key enabling technologies and software from NVIDIA for rapid deployment, management, and scaling of AI workloads in the modern hybrid cloud. NVIDIA AI Enterprise is licensed and supported by NVIDIA.

_images/components-02.png

The software in the NVIDIA AI Enterprise suite is organized into the following layers:

  • Infrastructure optimization software:

    • NVIDIA virtual GPU (vGPU) software

    • NVIDIA CUDA Toolkit

    • NVIDIA Magnum IO™ software stack for accelerated data centers

  • AI and data science frameworks:

    • TensorFlow

    • PyTorch

    • NVIDIA Transfer Learning Toolkit

    • NVIDIA Triton Inference Server

    • NVIDIA TensorRT

    • RAPIDS

The AI and data science frameworks are delivered as container images. Containerized software can be run directly with a tool such as Docker.

Understanding End-To-End AI Workflows

The NVIDIA AI Enterprise software suite provides you with everything you need to deploy and support AI infrastructure on VMware. The graphic below outlines a typical AI workflow and how tools, features and GPUs are deployed.

_images/ed-02.png

Starting on the top left side, AI practitioners have to prep data before they train the neural network. In order to do this, RAPIDS is a great tool for ML workloads, as well as formatting and labelling data which will be used for training workflows. Once the data is ready, the AI practitioner moves onto training. NVIDIA AI Enterprise offers pre-built, tuned containers for training neural networks with tools such as TensorFlow and PyTorch. Additional information regarding containers are covered in sections below. NVIDIA AI Enterprise also has a broad library of pretrained models that can extend with your own datasets by leveraging the NVIDIA Transfer Learning Toolkit. The AI practitioner can further optimize their newly trained models to be most efficient using NVIDIA’s TensorRT SDK and tools. This is done by fusing layers and eliminating unneeded steps. Finally, once the model is ready for Production at scale, the NVIDIA Triton Inference Server can service incoming inferencing requests. Available as a Docker container, the Triton Inference Server supports both GPU and CPU workloads. It allows for front-end client applications to submit inferencing requests for an AI inference cluster and can service models from an AI model repository.

Please refer to the AI Enterprise Solution Guides to further understand how to implement and deploy these end to end enterprise grade AI pipelines.

Many AI practitioners are looking to scale compute resources to reduce the time it takes to complete the training of a neural network and produce results in real-time. Taking a Multi-GPU approach brings scientists closer to achieving a breakthrough as they can more rapidly experiment with different neural networks and algorithms. As scientists train more and more models, the size and data consumptions can grow significantly.

Deploying virtual GPUs (vGPU) for Deep Learning workflows can be architected using three different approaches within a virtualized environment:

  • Single VM assigned a full or fractionalized-partitioned vGPU

  • Single VM with multiple vGPU devices

  • Multiple nodes (VMs)

Models can be small enough to run on one or more GPUs within a server, but as datasets grow, training times grow. This is where multi-node distributed training lends well for many organizations. The goal is to build a model using large datasets which understands patterns and relationships behind the data, rather than just the data itself. This requires an exchange of data between multi-nodes throughout the training process and GPUDirect RDMA with ATS provides high-performance networking between nodes. It is recommended that each organization evaluates their needs and chooses the correct architectural approach for executing Deep Learning workflows. For further detail in order to help assess architectural choices, please refer to the Multi-Node Deep Learning Training Solution Guide.

Supported Hardware and Software

NVIDIA GPUs:

  • NVIDIA A100 PCIe 40GB

  • NVIDIA A100 PCIe 80GB

  • NVIDIA A30

  • NVIDIA A40 (SR-IOV Mode Only)

  • NVIDIA A10

  • NVIDIA T4

NVIDIA-certified systems for NVIDIA GPU Cloud that support the supported NVIDIA GPUs and are also also certified for use with VMware vSphere ESXi hypervisor

Multi-node scaling requires an ethernet NIC that supports RoCE. For best performance, NVIDIA recommends using an NVIDIA® Mellanox® ConnectX®-6 Dx and an NVIDIA A100 GPU in each VM used for multi-node scaling. Please refer to the Sizing guide and the Multi-Node Training solution guide for further information.

Hypervisor software:

  • VMware vSphere Hypervisor (ESXi) Enterprise Plus Edition 7.0 Update 2

  • VMware vCenter Server 7.0 Update 2

Guest operating systems:

  • Ubuntu 20.04 with Linux kernel 5.4.0

NVIDIA AI Enterprise Software Components

Software Components

NVIDIA Release

NVIDIA vGPU Software

13.0 Beta

TensorFlow 2

21.05-tf2-py3

TensorFlow 1

21.05-tf1-py3

PyTorch

21.05-py3

NVIDIA Transfer Learning Tookit

v3.0-py3

NVIDIA Triton Inference Server

21.05-py3 and 21.05-py3-sdk

NVIDIA TensorRT

21.05-py3

RAPIDS

21.06-cuda11.2-base-ubuntu20.04