NanoGPT Training
This is a step-by-step guide on training a NanoGPT model with distributed training on Lepton.
Create Training Job
Navigate to the create job page, you can see the job configuration form.
- Job name: We can set it to
nanogpt-training
. - Resource: We need to use H100 GPUs, select H100 x8 and set worker count to 1. You can use multiple workers to speed up the training process.
- Image: Choose the custom image, and fill in image
nvcr.io/nvidia/pytorch:24.11-py3
- Run command: Copy the following codes to the run command field.
# Download the environment setup script from Lepton's GitHub repository, make it executable, and source it to initialize the environment variables.
wget -O init.sh https://raw.githubusercontent.com/leptonai/scripts/main/lepton_env_to_pytorch.sh
chmod +x init.sh
source init.sh
export NCCL_DEBUG=INFO
# Print the environment variables and list the files in the root directory to verify the setup.
env | grep RANK
pip install torch numpy transformers datasets tiktoken wandb tqdm
# Clone the NanoGPT repository
cd /workspace
git clone https://github.com/karpathy/nanoGPT.git
cd nanoGPT
# prepare train data
python data/shakespeare_char/prepare.py
ngpus=$(nvidia-smi -L | wc -l)
if [ ngpus != 8 ]; then
unset NCCL_SOCKET_IFNAME
fi
accum_steps=$((ngpus*WORLD_SIZE*4))
sed -i "s/gradient_accumulation_steps = 1/gradient_accumulation_steps = ${accum_steps}/g" config/train_shakespeare_char.py
torchrun \
--master_addr ${MASTER_ADDR} \
--nnodes ${WORLD_SIZE} \
--node_rank ${NODE_RANK} \
--nproc_per_node ${ngpus} \
train.py config/train_shakespeare_char.py
After the configuration is done, click on Create to submit the job, and you can see the job status in the job detail page.

The duration of the training job depends on the number of parameters you set and the number of GPUs you use. When the job is running, you can check the real-time logs and metrics.

