Distributed Training with MPI

Learn how to run distributed training with MPI in DGX Cloud Lepton.

MPI (Message Passing Interface) is a standard for parallel computing. It is a message-passing library that allows for the communication and coordination of processes in a distributed environment.

DGX Cloud Lepton supports MPI for distributed training. Here is an example of running a distributed MPI job with two workers on DGX Cloud Lepton.

Prepare the Python script for distributed training

As an example, this script implements distributed training of a convolutional neural network (CNN) on the MNIST dataset using PyTorch's DistributedDataParallel (DDP) to leverage multiple GPUs in parallel.

The file has been saved in the GitHub repository here.

Create Job through Dashboard

Head over to the Batch Jobs page, and follow the steps below to create a job.

Set up the job

Select Job Type

Select MPI on the top of the create job page.

select-mpi-template

Resource

In the resource section, first select which node group you want to use.

Select the resource type you want to use, for example, gpu.8xh100-sxm, and set the number of workers to the desired number. In this guide, we will use two replicas, so we set the number of workers to 2.

Container

In the container section, use the image (nvcr.io/nvidia/pytorch:25.08-py3) and paste the following command as the MPI Command to run the job:

Create and Monitor

Now you can click the Create button to create and run the job. After that, you can check the job logs or details to monitor the job.

Within the job details page, you can see the status of each worker and the logs for each worker. You can also use the Web Terminal to connect to the worker node and check the worker's status. Once the job is finished, you will see the job with a Completed state.

Copyright @ 2025, NVIDIA Corporation.