Multi-Node Training for AI on Kubernetes (VMware Tanzu)
Multi-Node Training for AI on Kubernetes (VMware Tanzu) (Latest Version)

Overview

Welcome to the trial of NVIDIA AI Enterprise NVIDIA LaunchPad.

The NVIDIA AI Enterprise suite includes the applications, frameworks, and tools that AI researchers, data scientists, and developers use for creating their AI and Machine Learning applications.

nvaie-overview.png

NVIDIA AI Enterprise allows AI Practitioners to run Deep Learning workflows in virtual machines with the same performance as a local workstation. AI Practitioners can quickly access Jupyter Notebooks, which leverage NVIDIA GPUs since IT Administrators have all the tools to create VMs with required NVIDIA AI Enterprise components to perform AI Training and deploy inferencing using Triton. This allows AI Practitioners to have instant access to valuable GPU resources within Enterprise data centers.

In this LaunchPad lab, you will run through an ML engineer workflow on Tanzu. Tanzu is an enterprise Kubernetes distribution offered by VMware. It makes it easy to create GPU accelerated Kubernetes clusters on VMs on the fly for different teams of your organization (think of it as a service that makes Kubernetes clusters). In this lab, you will learn how to train a deep learning model on a GPU instance and then scale the AI training using Horovod to run on multiple GPU nodes within a Tanzu Kubernetes cluster effortlessly. These GPU accelerated nodes leverage Message Passing Interface (MPI) Operator and NVIDIA Collective Communications Library(NCCL) which is part of Magnum IO and is available on all NVIDIA AI Enterprise containers.

© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.