Step #1: Single-Node Training

Multi-Node Training for AI on Kubernetes (VMware Tanzu) (Latest Version)

In this lab, you will train a basic Mobilenet image classification model with Keras on a single node, then in Step 2, you will scale the deep learning training to run on two nodes. The Tanzu cluster, with GPU operator pre-installed, is readily available to you. Therefore you will go through the steps which a Machine Learning engineer would execute, install MPI operator and then run the multinode training. Scaling out the training on two nodes will cut down the training time by half.


To assist you in your LaunchPad lab, there are a couple of important links on the left pane of this page. In the next step, you will use the Jupyter notebook. SSH link to the VM console is provided as well, you will use this in the second step to install the MPI Operator.

First, you will explore the machine learning workflow by doing the following:

  • Learn how to use Horovod and run it on a single node Jupyter instance.

  • Examine the data

  • Build an input pipeline.

  • Build the model.

  • Train the model.

  • Test the model.

To get started, follow the steps below.

  1. Using the Jupyter notebook link on the left-hand navigation pane, open the Jupyer notebook.

  2. Run through the image classification training Jupyter notebook to train a MobileNet model on the Stanford Online Products dataset.


To run a cell on the Jupyter Notebook, Click on the cell you want to run and press Shift + Enter. Linux bash commands can be run inside the Jupyter Notebook by adding a bang symbol (!) before the command inside the Jupyter Notebook cell.

© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.