Distributed Training with PyTorch

Learn how to run distributed training with PyTorch in DGX Cloud Lepton.

For batch jobs, some frameworks like PyTorch, TensorFlow, and Horovod support distributed training. DGX Cloud Lepton also supports distributed training for these frameworks.

Here is an example of running a distributed PyTorch job with two workers on DGX Cloud Lepton.

Prepare the Python script for distributed training

As an example, this script implements distributed training of a convolutional neural network (CNN) on the MNIST dataset using PyTorch's DistributedDataParallel (DDP) to leverage multiple GPUs in parallel.

The file has been saved in the GitHub repository here.

Create Job through Dashboard

Head over to the Batch Jobs page, and follow the steps below to create a job.

Set up the job

Select Job Type

Select Torchrun on the top of the create job page.

select-torchrun-template

Resource

In the resource section, first select which node group you want to use.

Select the resource type you want to use, for example, gpu.8xh100-sxm, and set the number of workers to the desired number. In this guide, we want to use 2 replicas, so we set the number of workers to 2.

Container

In the container section, use the image (nvcr.io/nvidia/pytorch:25.08-py3) and paste the following command as the Torchrun Command to run the job:

Create and Monitor

Now you can click the Create button to create and run the job. After that, you can check the job logs or details to monitor the job.

Within the job details page, you can see the status of each worker and the logs for each worker. You can also use the Web Terminal to connect to the worker node and check the worker's status. Once the job is finished, you will see the job with a Completed state.

Copyright @ 2025, NVIDIA Corporation.