Getting Started with Training Steps#
This page walks through one supervised fine tuning (SFT) run on DGX Cloud Lepton using the tiny configuration.
The tiny configuration lives in src/nemotron/steps/sft/automodel/config/tiny.yaml and is meant for short validation before you scale work.
The goal is to validate end-to-end execution, NeMo Run, and your environment profile on real multi-node hardware.
Prerequisites#
You need access to DGX Cloud Lepton with GPU nodes. This path assumes two nodes with eight A100 80 GB GPUs per node, matching the
run.envblock insrc/nemotron/steps/sft/automodel/config/tiny.yaml.You set the following environment variables:
HF_TOKENWANDB_API_KEYNVIDIA_API_KEY
You ran
lep loginafter syncronizing dependencies and are logged into Lepton.
The preceding list applies to the steps on this page. Refer to Limitations and Restrictions for information about supported environments.
Procedure#
Clone the repository, if you haven’t already:
$ git clone https://github.com/NVIDIA-NeMo/Nemotron && cd Nemotron
Set the dependencies:
$ uv sync
Create an
env.tomlat the root of the repository like the following example:[lepton_base] executor = "lepton" container_image = "nvcr.io/nvidia/nemo:25.11.nemotron_3_nano" node_group = "<lepton-node-group>" nemo_run_dir = "/mnt/lustre-shared/<your-username>/experiments" remote_job_dir = "/mnt/lustre-shared/<your-username>/output/functional" workspace = "/mnt/lustre-shared" ray_version = "2.48.0" shared_memory_size = 65536 pip_extras = ["typer", "rich", "pydantic-settings"] mounts = [ { from = "node-nfs:<lepton-fileset-alias>", path = "<lepton-mount-source-path>", mount_path = "/mnt/lustre-shared" } ] env_vars = { HF_TOKEN = "${oc.env:HF_TOKEN,''}", HF_HOME = "/mnt/lustre-shared/hf", WANDB_API_KEY = "${oc.env:WANDB_API_KEY,''}", WANDB_PROJECT = "<project>", NVIDIA_API_KEY = "${oc.env:NVIDIA_API_KEY,''}", RAY_DEDUP_LOGS = "0", RAY_GRAFANA_IFRAME_HOST = "" } [lepton_sft_automodel] extends = "lepton_base" container_image = "nvcr.io/nvidia/nemo-automodel:26.04" resource_shape = "gpu.8xa100-80gb" nodes = 2 pip_extras = ["typer", "rich", "pydantic-settings", "omegaconf"]Summary of the Config File
The
[lepton_base]table defines cluster fundamentals that every profile inherits: the executor, the base container image, the node group, shared-storage paths, the Ray runtime version, the shared-memory size, the Python package extras the Nemotron CLI needs, the cluster mount, and anenv_varsblock whose${oc.env:VAR,''}entries pull credentials from your shell at submit time. The[lepton_sft_automodel]table extends the base and adds the AutoModel container image, the resource shape needed for full-parameter SFT, the node count, and the additional Python package extras the AutoModel runtime expects.Contact your cluster administrator for the values that replace the placeholders.
<lepton-node-group>: The Lepton node group identifier for the cluster you have access to.<your-username>: A directory you own on the shared mount where NeMo Run records each experiment.<lepton-fileset-alias>: The alias of the Lepton storage fileset that each container mounts.<lepton-mount-source-path>: The host path the fileset exposes.<project>: The Weights & Biases project name the run reports to.
Export
HF_TOKEN,WANDB_API_KEY, andNVIDIA_API_KEYin your shell before submitting; the env file pulls them in without writing the values to disk.If you would rather generate a complete env file with every Nemotron training profile pre-wired, run the bundled environment profile generator instead of writing the file by hand.
$ uv run nemotron steps run env/env_toml -c lepton output_path=env.toml force=true
The generator emits every canonical profile the training steps expect, including data-prep, SFT, PEFT, RL, and pretrain variants.
View the step manifest and run specification:
$ uv run nemotron steps show sft/automodel
Example Output
──────────────────────────── sft/automodel — SFT Training (AutoModel) ──────────────────────────── ~/nemotron/src/nemotron/steps/sft/automodel Supervised fine-tuning with the AutoModel stack for HF-format models and JSONL datasets that already use OpenAI chat-format messages. Supports full SFT and LoRA-style adapter tuning from the same step. Consumes • training_jsonl — Instruction data in JSONL with a messages field Produces • checkpoint_hf — HuggingFace checkpoint directory (full model or adapter-style PEFT output) Parameters • peft (default=null) — Use 'lora' for adapter tuning, or 'null' for full fine-tuning. Runspec launcher: torchrun image: - resources: nodes=1 gpus_per_node=4 config dir: ~/nemotron/src/nemotron/steps/sft/automodel/config default config: default
Compile the job against your Lepton profile without submitting it. The profile name
lepton_sft_automodelmust match a table in your rootenv.toml.$ uv run nemotron steps run sft/automodel --config tiny --run lepton_sft_automodel --dry-run
Partial Output
Compiled Configuration ╭─────────────────────────────────────────── run ───────────────────────────────────────────╮ │ env: │ │ nodes: 2 │ │ gpus_per_node: 8 │ │ nprocs_per_node: 8 │ │ executor: lepton │ │ container_image: nvcr.io/nvidia/nemo-automodel:26.04 │ │ node_group: az-sat-lepton-001 │ │ resource_shape: gpu.8xa100-80gb │ │ remote_job_dir: /mnt/lustre-shared/user/nemotron/.nemotron-jobs ...
Submit the sample SFT job:
$ uv run nemotron steps run sft/automodel -c tiny -r lepton_sft_automodel
The sample tiny config sets small training and validation splits.
To specify the output path for checkpoints, set SFT_OUTPUT_DIR before running or specify the checkpoint.checkpoint_dir CLI override.
Success Checks#
The command
nemotron steps show <step_id>listsconsumesandproducesartifact types. Those types must line up with your pipeline when you chain steps.A finished sample run leaves logs and job metadata where NeMo Run is configured to write them. See Execution through NeMo Run for experiment layout.
If you change tokenizer, template, or sequence length, keep them consistent across every step that touches the same model line. The Artifact Graph page explains why consistency matters.
Next Steps#
Follow Run SFT with AutoModel on Custom Data when you need to point
tiny.yamlat your own data or change the base model.Read Choose an SFT Backend when you need Megatron Bridge instead of AutoModel.