sft/megatron_bridge#

This step runs supervised fine-tuning (SFT) on a Megatron checkpoint by using NVIDIA Megatron-Bridge. It supports tensor, pipeline, and context parallelism for large-scale distributed training of the Nemotron model family. The step consumes packed Apache Parquet shards produced by data_prep/sft_packing.

Syntax#

nemotron steps run sft/megatron_bridge \
    [-c <config-name-or-path>] \
    [-r <run-profile> | -b <batch-profile>] \
    [-d] \
    [--force-squash] \
    [<dotlist-overrides>...] \
    [<passthrough-args>...]

See the Nemotron Steps CLI Reference for the shared flag set.

Configuration Files#

The step ships three configuration files under src/nemotron/steps/sft/megatron_bridge/config/.

File

Purpose

default.yaml

Two-node Slurm functional-test configuration. Loads base weights from Hugging Face via AutoBridge with LoRA enabled (peft: lora). Not the programmatic default.

nano3.yaml

Production full-SFT configuration for the Nano3 model (peft: null, 1700 training iterations). This is the programmatic default loaded when no -c flag is specified.

tiny.yaml

Short validation run against packed Parquet shards on a two-node Lepton profile.

Pass the configuration name with -c:

$ nemotron steps run sft/megatron_bridge -c tiny
$ nemotron steps run sft/megatron_bridge -c default

Inputs and Outputs#

Direction

Artifact Type

Required

Description

Consumes

packed_parquet

Yes

Packed SFT Parquet shards with input_ids and loss_mask columns. Produce these shards with data_prep/sft_packing first.

Consumes

checkpoint_megatron

No

A pretrained Megatron checkpoint or a prior Megatron SFT checkpoint. When this input is absent, the step loads weights from the Hugging Face model declared by hf_model_path.

Produces

checkpoint_megatron

A fine-tuned Megatron distributed checkpoint.

Supported Models#

Model

Minimum GPUs

Default

Notes

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

8

Yes

Nemotron 3 Nano with 31.6 billion total and 3.2 billion active parameters. This model is the default Nano3 path.

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

32

No

Nemotron 3 Super with 120.6 billion total and 12.7 billion active parameters. Typical runs start at 32 GPUs.

Step Parameters#

The manifest declares two Nemotron-specific parameters. Pass them as dotlist overrides.

seq_length=N#

The training sequence length. This value must match the pack_size you used in data_prep/sft_packing.

Choices: 2048, 4096, 8192, 16384, 32768.

Default: 4096.

Example: seq_length=8192

peft=VALUE#

Selects low-rank adaptation (LoRA) tuning instead of full SFT. Set this value to lora for adapter tuning, or to null for full fine-tuning when the model and optimizer states fit in memory.

Choices: lora, null.

Default: null (as set in nano3.yaml, the programmatic default config). The default.yaml functional-test config sets this to lora.

Example: peft=null

Frequently used dotlist overrides drawn from the underlying recipe include the following.

hf_model_path=<id-or-path>#

The Hugging Face identifier or local path used to load base weights through AutoBridge when no Megatron checkpoint is supplied.

Example: hf_model_path=nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

recipe.tensor_model_parallel_size=N#

The tensor-model-parallel degree applied by the Nano3 finetune recipe.

Example: recipe.tensor_model_parallel_size=8

recipe.pipeline_model_parallel_size=N#

The pipeline-model-parallel degree applied by the Nano3 finetune recipe.

Example: recipe.pipeline_model_parallel_size=4

train.global_batch_size=N#

The global batch size for the training loop.

Example: train.global_batch_size=128

checkpoint.save=PATH#

The directory where the Megatron-Bridge recipe writes checkpoints.

Example: checkpoint.save=/lustre/runs/nano3-sft/checkpoints

Strategies#

The manifest records the following operator strategies for sft/megatron_bridge.

  • When the dataset has fewer than ten thousand records, lower train.global_batch_size and raise the number of training iterations to keep optimizer steps useful.

  • When the operator wants LoRA tuning, set peft=lora to lower the GPU requirement and shrink the checkpoint footprint.

  • When the operator selects the Super3 model, start from a 32-GPU plan with tp=8, pp=4, cp=1, and verify cluster topology before scaling further.

  • When seq_length > 32768, enable hybrid context parallelism.

  • When GPU memory is tight, such as on A100 40 GB hardware, enable activation checkpointing and consider central-processing-unit (CPU) offloading.

  • When you want maximum throughput on H100 hardware, keep packed sequences enabled and tune overlap and sequence-packing settings before scaling up.

Common Errors#

tokenizer_mismatch#

Cause: the tokenizer used during data_prep/sft_packing differs from the tokenizer used for training, so token identifiers do not align.

Recovery: set the data_prep/sft_packing tokenizer to match the training model and regenerate the packed Parquet shards.

oom#

Cause: GPU memory is exhausted during forward, backward, or optimizer steps.

Recovery: reduce train.global_batch_size, increase parallelism, or reduce seq_length.

missing_packed_data#

Cause: the training loop cannot find packed Parquet shards at the configured dataset.nano3_packed_sft_dir.

Recovery: run data_prep/sft_packing first, or override dataset.nano3_packed_sft_dir to point at the directory that holds the packed splits.

Command Examples#

Run the tiny validation configuration on the two-node Lepton SFT profile:

$ nemotron steps run sft/megatron_bridge -c tiny -r lepton_sft_megatron_bridge

Compile the default configuration without submitting the job:

$ nemotron steps run sft/megatron_bridge -c default --dry-run

Submit a detached LoRA run on Slurm with a longer sequence length:

$ nemotron steps run sft/megatron_bridge -c default -b slurm_sft_megatron_bridge \
    peft=lora \
    seq_length=8192 \
    train.global_batch_size=256

Submit an attached run on the Super3 base model with eight-way tensor parallelism and four-way pipeline parallelism:

$ nemotron steps run sft/megatron_bridge -c default -r lepton_sft_megatron_bridge \
    hf_model_path=nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
    recipe.tensor_model_parallel_size=8 \
    recipe.pipeline_model_parallel_size=4