NeMo Framework Quick Start Guide

This guide demonstrates how to launch an LLM pre-training job on DGX Cloud Lepton using NeMo Framework with minimal effort. You can use this guide to validate your cluster's setup and its readiness for larger-scale distributed jobs. For a complete end-to-end tutorial on training an LLM from scratch, refer to the full guide.

Requirements

To validate NeMo Framework on DGX Cloud Lepton, you need:

  • An NVIDIA DGX Cloud Lepton cluster with at least 2x A100 or newer GPU nodes with 8 GPUs each.
  • Python 3.10 or newer with PIP installed on a local machine.
  • A shared filesystem with read/write access that is mountable in jobs.

Initial Setup

We need to install a few Python dependencies required to launch the script. Open a terminal on your computer and navigate to a directory of choice to save the launching scripts. Install the necessary dependencies with the following commands in the terminal:

python3 -m venv env
source env/bin/activate
pip3 install nemo_toolkit[nlp] git+https://github.com/NVIDIA/nemo-run megatron-core opencc==1.1.6
Note

The source env/bin/activate command above activates a Python virtual environment with the dependencies installed. If you need to leave the virtual environment, you can run deactivate. To activate it again, navigate back to the directory where the virtual environment named env was saved and run source env/bin/activate again. If you run into ModuleNotFound errors, it is likely the environment needs to be re-activated.

Authenticate with DGX Cloud Lepton

The Lepton Python SDK needs to authenticate with the DGX Cloud Lepton workspace to schedule the jobs. To authenticate with the CLI, open the DGX Cloud Lepton UI and navigate to the Settings > Tokens page. This will show a command to authenticate with your workspace that will look similar to the following:

lep login -c xxxxxx:************************

Copy the code shown in the UI and run it locally in your terminal with the virtual environment active.

Launch Pre-training Script

Now you can launch a pre-training job with NeMo Framework using synthetic data. The system will generate fake data during pre-training. While the resulting trained model won't be meaningful, this process helps validate the entire software stack and assess performance.

We'll start with a template script for single-GPU pre-training, then modify it to run on eight GPUs on a single node, and finally scale to 16 GPUs across two nodes.

Example script template

This is the template for running launching pre-training. Copy this locally as train.py:

from nemo.collections import llm
from creds import hf_creds

import nemo_run as run


def configure_recipe(nodes: int = 1, gpus_per_node: int = 1):
    recipe = llm.nemotron3_4b.pretrain_recipe(
        dir="/nemo-workspace/nemotron3-4b", # Path to store checkpoints
        name="nemotron3-4b",
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
        max_steps=100,
    )

    recipe.trainer.val_check_interval = 0.0
    recipe.trainer.strategy.ckpt_load_strictness = False
    return recipe

def lepton_executor(nodes: int = 1, devices: int = 1) -> run.LeptonExecutor:
    if devices > 1:
        resource_shape = f"gpu.{devices}xa100-80gb"
    else:
        resource_shape = "gpu.a100-80gb"
    mounts = [
        {
            "path": "/",
            "mount_path": "/nemo-workspace",
            "from": "node-nfs:my-nfs",
        }
    ]

    return run.LeptonExecutor(
        resource_shape=resource_shape,
        container_image="nvcr.io/nvidia/nemo:25.04",
        nemo_run_dir="/nemo-workspace/nemo-run",
        mounts=mounts,
        node_group="xxxxx",
        nodes=nodes,
        nprocs_per_node=devices,
        env_vars={
            "HF_TOKEN": hf_creds,
            "TORCH_HOME": "/nemo-workspace/.cache"
        },
        launcher="torchrun",
        packager=run.PatternPackager(
            include_pattern="scripts/*",
            relative_path="",
        )
    )

def run_pretraining():
    recipe = configure_recipe(nodes=1, gpus_per_node=1)
    executor = lepton_executor(nodes=recipe.trainer.num_nodes, devices=recipe.trainer.devices)

    run.run(recipe, executor=executor)

if __name__ == "__main__":
    run_pretraining()

The script template will need to be modifed based on your cluster. These settings are as follows:

  • resource_shape = f"gpu.{devices}xa100-80gb" and resource_shape = "gpu.a100-80gb": Replace the a100-80gb piece in both lines with the desired resource shape. This is the GPU type and configuration to use for the job. If your cluster has H100 GPUs, this would likely be h100-80gb. Be sure to update this in both sections as this will automatically handle the different formats for resource shapes.
  • node_group="xxxxx": Replace xxxxx with the node group to run in. The list of available node groups can be found in the Nodes tab in the UI.
  • "from": "node-nfs:my-nfs": Enter the name of the storage to mount in all jobs. This can be found in the UI while creating a job and selecting a storage option.

Pre-train using a single GPU

After modifying the template script for your environment, activate the virtual environment if not already done with:

source env/bin/activate

Next, launch the pre-training job with:

python3 train.py

This will copy the training script to the cluster storage and spin up a Batch Job using the NeMo:25.04 container on a single GPU. You can view the job in the Batch Jobs page in the UI. After the job starts, it will take a couple minutes for all of the libraries to load and training progress will be shown in the logs. You'll know training has started when you see a line similar to this:

Training epoch 0, iteration 1/149999 | lr: 4.5e-05 | consumed_samples: 512 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.67 | train_step_timing in s: XXX

The train_step_timing in s line shows the overall throughput.

Pre-train using eight GPUs

Moving on, scale the job up to 8 GPUs for a single node. Edit the train.py script and change this line:

recipe = configure_recipe(nodes=1, gpus_per_node=1)

to this:

recipe = configure_recipe(nodes=1, gpus_per_node=8)

Save the file and run the script again with:

python3 train.py

This will launch the same job as before, but now on 8 GPUs. Check the job in the UI and view the performance. Ideally, the train_step_timing in s should be approximately 8x faster than the 1 GPU numbers earlier.

Pre-train on two nodes

Finally, update the script to run on 2 nodes with 8 GPUs per node. Edit the train.py script and change this line:

recipe = configure_recipe(nodes=1, gpus_per_node=8)

to this:

recipe = configure_recipe(nodes=2, gpus_per_node=8)

Save the file and run the script again with:

python3 train.py

This will launch a third job on 16 total GPUs on 2 nodes. Check the latest job in the UI and compare the train_step_timing in s again. This should ideally be approximately 2x faster than the 8 GPU job run earlier.

Clean Up

After completing all jobs, you can remove them in the UI by clicking the Delete button next to each job in the Batch Jobs page.

Next steps

If you made it this far, congratulations! Your DGX Cloud Lepton workspace should now be ready for distributed training jobs.

For an in-depth guide on training and fine-tuning LLMs with NeMo Framework, refer to our comprehensive guide.

Copyright @ 2025, NVIDIA Corporation.