Distributed Training with MPI
Learn how to run distributed training with MPI in DGX Cloud Lepton.
MPI (Message Passing Interface) is a standard for parallel computing. It is a message-passing library that allows for the communication and coordination of processes in a distributed environment.
DGX Cloud Lepton supports MPI for distributed training. Here is an example of running a distributed MPI job with two workers on DGX Cloud Lepton.
Prepare the Python script for distributed training
As an example, this script implements distributed training of a convolutional neural network (CNN) on the MNIST dataset using PyTorch's DistributedDataParallel (DDP) to leverage multiple GPUs in parallel.
The file has been saved in the GitHub repository here.
Create Job through Dashboard
Head over to the Batch Jobs page, and follow the steps below to create a job.
Set up the job
Select Job Type
Select MPI on the top of the create job page.

Resource
In the resource section, first select which node group you want to use.
Select the resource type you want to use, for example, gpu.8xh100-sxm, and set the number of workers to the desired number.
In this guide, we will use two replicas, so we set the number of workers to 2.
Container
In the container section, use the image (nvcr.io/nvidia/pytorch:25.08-py3) and paste the following command as the MPI Command to run the job:
Create and Monitor
Now you can click the Create button to create and run the job. After that, you can check the job logs or details to monitor the job.
Within the job details page, you can see the status of each worker and the logs for each worker. You can also use the Web Terminal to connect to the worker node and check the worker's status. Once the job is finished, you will see the job with a Completed state.