NanoGPT Training

Learn how to train a NanoGPT model with distributed training on DGX Cloud Lepton.

This is a step-by-step guide for training a NanoGPT model with distributed training on DGX Cloud Lepton.

Create Training Job

Navigate to the create job page where you can see the job configuration form.

  • Job name: Set it to nanogpt-training.
  • Resource: Select H100 x8 GPUs and set the worker count to 1. You can use multiple workers to speed up the training process. resource
  • Image: Choose the custom image and enter nvcr.io/nvidia/pytorch:24.11-py3
  • Run command: Copy the following code to the run command field.

After completing the configuration, click Create to submit the job. You can then view the job status on the job detail page.

job created

The duration of the training job depends on the number of parameters you set and the number of GPUs you use. When the job is running, you can check the real-time logs and metrics.

logs

timeline

Copyright @ 2025, NVIDIA Corporation.