AI-Ready Enterprise Platform

AI-Ready Platform From NVIDIA

NVIDIA AI Enterprise is a software suite that enables organizations to harness the power of AI, even if they don’t have AI expertise today. Optimized to streamline AI development and deployment, NVIDIA AI Enterprise includes proven, open-sourced containers and frameworks which are certified to run on common data center platforms from VMware and Red Hat, NVIDIA Certified servers configured with GPUs or CPU only, and on the public cloud. Since support is included, organizations get the transparency of open source and assurance that the global NVIDIA Enterprise Support team will help AI projects stay on track. With NVIDIA AI Enterprise software, AI is accessible to organizations of any size, providing the compute power, tools, and support so organizations can focus on creating business value from AI, not on the AI infrastructure.

../_images/components-01.png

Optimized so every organization can be good at AI: Every step of the AI workflow is streamlined, from data prep, to training, to inference and deployment and AI practitioners can train on complex neural network models as well as tree-based models. Optimized for AI development and deployment, NVIDIA AI Enterprise includes proven, open-sourced containers and frameworks that ease adoption of enterprise AI such as conversational AI often used for automated customer support and digital sales agents and computer vision used for segmentation, classification and detection.

Certified to deploy anywhere: Certified to run on mainstream NVIDIA-Certified servers with NVIDIA AI Enterprise software—whether virtualized, bare metal, CPU-only, in the public cloud or GPU accelerated— NVIDIA AI Enterprise can be deployed nearly anywhere and enables AI projects to be portable across today’s increasingly hybrid data center.

Supported by NVIDIA: With NVIDIA enterprise support included, both the AI practitioner and IT administrative teams have access to NVIDIA experts globally, for coordinated support across the full solution including partner products, as well as control of upgrade and maintenance schedules with long term support (LTS) options, and access to instructor led customer trainings and knowledge base resources.

Understanding End-To-End AI Workflows

The NVIDIA AI Enterprise software suite provides you with everything you need to deploy and support AI infrastructure. The graphic below outlines a typical AI workflow and how tools, features and GPUs are deployed.

../_images/ed-02.png

Starting on the top left side, AI practitioners have to prep data before they train the neural network. In order to do this, RAPIDS is a great tool for ML workloads, as well as formatting and labelling data which will be used for training workflows. Once the data is ready, the AI practitioner moves onto training. NVIDIA AI Enterprise offers pre-built, tuned containers for training neural networks with tools such as TensorFlow and PyTorch. NVIDIA TAO Toolkit gives you a faster, easier way to accelerate training and quickly create highly accurate and performant, domain-specific vision, and conversational AI models. Additional information regarding containers are covered in sections below. The AI practitioner can further optimize their newly trained models to be most efficient using NVIDIA’s TensorRT SDK and tools. This is done by fusing layers and eliminating unneeded steps. Finally, once the model is ready for Production at scale, the NVIDIA Triton Inference Server can service incoming inferencing requests. It allows for front-end client applications to submit inferencing requests for an AI inference cluster and can service models from an AI model repository.

Please refer to the AI Enterprise Solution Guides to further understand how to implement and deploy these end to end enterprise grade AI pipelines.

NVIDIA AI Enterprise

The NVIDIA AI Enterprise software suite includes AI frameworks and containers that provide performance-optimized data science, training, and inference frameworks and tools that simplify building, sharing, and deploying AI software, so enterprises can gather insights faster and deliver business value sooner. Even organizations that lack AI expertise can adopt AI because NVIDIA AI Enterprise includes easy-to-use tools for every stage of the AI workflow, from data prep to training, inferencing, and deploying at scale.

../_images/components-02.png
  • NVIDIA TAO Toolkit - gives you a faster, easier way to accelerate training and quickly create highly accurate and performant, domain-specific vision, and conversational AI models. It abstracts away the AI/deep learning framework complexity, letting you fine-tune on high-quality NVIDIA pre-trained models with only a fraction of the data compared to training from scratch. Developers can beyond customization and optimize these models required for low-latency, high-throughput inference. This enables you to create custom, production-ready AI models in hours, rather than months, without a huge investment in AI expertise.

  • NVIDIA RAPIDS - the first step in the end-to-end AI flow requires data prep before the neural networks can be trained. NVIDIA RAPIDS is optimized for GPU acceleration. It reduces data science processes from hours to seconds, when combined with NVIDIA A100, for up to 70x faster performance, and up to 20x more cost-effective when compared to similar CPU-only configurations.

  • PyTorch and TensorFlow - Open-source deep learning frameworks for training and machine learning, such as PyTorch and TensorFlow, are integrated with NVIDIA RAPIDS to simplify enterprise AI development. Leveraging these tools and pre-trained models, accelerates development and deployment cycles, eliminating the need to procure, manage, certify and deploy different environments.

  • TensorRT - based applications perform up to 40X faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded, or automotive product platforms.

  • NVIDIA Triton Inference Server - Triton Inference Server simplifies and optimizes the deployment of AI models at scale in production. It integrates with Kubernetes for orchestration and auto-scaling and allows front end client applications to submit inference requests from an AI inference cluster and can service models from an AI model repository. Triton Inference Server supports all major frameworks, such as TensorFlow, TensorRT, PyTorch, MXNet, Python, and more. Triton Inference Server also includes the RAPIDS Forest Inference Library (FIL)1 backend for GPU and CPU inference of Random Forests, GBDTs, and Decision Tree models. Triton with FIL backend delivers the best inference performance for tree-based models on GPUs, enabling simplified deployment of large tree models on GPUs with low latency and high accuracy.

Supported Hardware and Software

NVIDIA GPUs:

  • NVIDIA DGX H100³

  • NVIDIA H100 PCIe⁴

  • NVIDIA DGX A100³

  • NVIDIA A100 40GB

  • NVIDIA A100 HGX 40GB

  • NVIDIA A100X 40GB

  • NVIDIA A100 80GB

  • NVIDIA A100 HGX 80GB

  • NVIDIA A100X 80GB

  • NVIDIA A40¹

  • NVIDIA A30

  • NVIDIA A30X

  • NVIDIA A10

  • NVIDIA A16

  • NVIDIA A2

  • NVIDIA RTX A6000²

  • NVIDIA RTX A5000²

  • NVIDIA T4

  • NVIDIA V100

Note

¹The default mode on NVIDIA A40 is Display Off Mode and supports SR-IOV which is required to run NVIDIA AI Enterprise.

²The default mode on NVIDIA A6000 and A5000 is Display On Mode and needs to be toggled off. To change the mode of a GPU that supports multiple display modes, use the displaymodeselector tool, which you can request from the NVIDIA Display Mode Selector Tool page on the NVIDIA Developer website.

³The NVIDIA DGX H100 and NVIDIA DGX A100 are currently bundled with NVIDIA AI Enterprise. Certain deployment types like virtualization are not supported with DGX systems. Further information on DGX systems can be found in the DGX Systems Documentation.

⁴The NVIDIA H100 PCIe is currently only supported for bare metal deployments.

NVIDIA-Certified Systems specifically certified for NVIDIA AI Enterprise.

Multi-node scaling requires an ethernet NIC that supports RoCE. For best performance, NVIDIA recommends using an NVIDIA® Mellanox® ConnectX®-6 Dx and an NVIDIA A100 GPU in each VM used for multi-node scaling. Please refer to the Sizing guide and the Multi-Node Training solution guide for further information.

Note

All supported configurations are listed in the NVIDIA AI Enterprise Product Support Matrix.

Hypervisor software:

Note

NVIDIA AI Enterprise 2.3 only supports bare metal and pass-through deployments with the Data Center Driver.

  • VMware vSphere Hypervisor (ESXi) Enterprise Plus Edition 7.0 Update 2 or later

  • VMware vCenter Server 7.0 Update 2 or later

NVIDIA AI Enterprise 1.1

  • VMware vSphere Hypervisor (ESXi) Enterprise Plus Edition 6.7

  • VMware vCenter Server 6.7

  • Note

    VMware vSphere 6.7 only supports T4 and V100 GPUs.

Guest and Bare Metal operating systems:

NVIDIA AI Enterprise 2.1

  • Red Hat Enterprise Linux 9.0

  • Ubuntu 22.04

NVIDIA AI Enterprise 2.0

  • Red Hat CoreOS 4.9 or later

Container Orchestration Platforms:

NVIDIA AI Enterprise 1.1

  • VMware vShpere 7.0 Update 3c with Tanzu

NVIDIA AI Enterprise 2.0

  • Red Hat OpenShift 4.9 or later

NVIDIA AI Enterprise Software Components

Software Components

NVIDIA Release Version

NVIDIA vGPU Software

510.85.03

NVIDIA AI Enterprise Driver Software

510.85.02

NVIDIA GPU Operator

v1.11.1

NVIDIA Network Operator

v1.2.0

TensorFlow 1

22.07

Pytorch

22.07

NVIDIA Triton Inference Server

22.07

NVIDIA TensorRT

22.07

NVIDIA RAPIDS

22.06

TAO Toolkit for Language Model (Conv AI)

3.22.05

TAO Toolkit for Conv AI

3.22.05

TAO Toolkit for CV

3.22.05

Note

Pull tags for each container, operator, and driver can be found in the Enterprise Catalaog