Stage 1: Supervised Fine-Tuning (SFT)#

This stage fine-tunes the pretrained Nemotron 3 Super model for instruction following using Megatron-Bridge.


Training Methodology#

Chat Template: The chat template is identical to Nemotron 3 Nano.

Training Framework: SFT is implemented using Megatron-Bridge’s finetune() entry point, which loads a pretrained checkpoint and handles the training loop with role-based loss masking. See Training Entry Points for implementation details.

Two-Stage SFT Loss#

Nemotron 3 Super uses a novel two-stage SFT loss procedure. The model is supervised only on output (assistant) tokens, with prompt tokens masked.

Stage 1: Token-level (global) average. The loss averages over all output tokens in the packed global batch:

\[ \mathcal{L}_{\text{tok}} = \frac{\sum_{c \in \mathcal{B}} \sum_{t \in \mathcal{O}_c} \ell_t}{\sum_{c \in \mathcal{B}} |\mathcal{O}_c|} \]

This corresponds to summing output-token log probabilities across all conversations and normalizing by the total number of output tokens.

Stage 2: Sample-level average. The loss then switches to a per-conversation normalized loss averaged equally across conversations:

\[ \mathcal{L}_{\text{samp}} = \frac{1}{|\mathcal{B}|} \sum_{c \in \mathcal{B}} \left( \frac{1}{|\mathcal{O}_c|} \sum_{t \in \mathcal{O}_c} \ell_t \right) \]

This stage reduces the dominance of long outputs by normalizing each conversation by its own output-token count before averaging across the batch.

The switch from token-level to sample-level loss is configured in the recipe YAML (config/default.yaml).

MTP During SFT#

The shared-weight MTP head from pretraining is continued during SFT to preserve both the accuracy benefits of multi-step prediction and the inference-time gains from speculative decoding. Two MTP layers with shared parameters are trained using a scaled auxiliary loss (MTP loss scaling factor 0.3) computed with per-token loss.

Data Preparation Pipeline#

Before training, chat conversations are transformed into training-ready sequences through several stages:

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart LR
    subgraph prep["Data Preparation"]
        direction LR
        chat["OpenAI Chat<br/>Format"] --> template["Chat<br/>Template"]
        template --> chunks["Role-Labeled<br/>Chunks"]
        chunks --> tok["Tokenization"]
        tok --> mask["Loss Mask<br/>(role-based)"]
        mask --> pack["Packing"]
        pack --> roll["Mask Rolling"]
    end
    roll --> npy["Parquet Output"]

    style chat fill:#e3f2fd,stroke:#2196f3
    style template fill:#e3f2fd,stroke:#2196f3
    style chunks fill:#e3f2fd,stroke:#2196f3
    style tok fill:#f3e5f5,stroke:#9c27b0
    style mask fill:#f3e5f5,stroke:#9c27b0
    style pack fill:#fff3e0,stroke:#ff9800
    style roll fill:#fff3e0,stroke:#ff9800
    style npy fill:#e8f5e9,stroke:#4caf50
    

Stage

What Happens

OpenAI Chat Format

Input messages with role (system/user/assistant) and content fields

Chat Template

Renders messages using the Jinja chat template (identical to Nemotron 3 Nano)

Role-Labeled Chunks

Splits rendered text back into chunks, each tagged with its source role

Tokenization

Converts text chunks to token IDs

Loss Mask

Builds mask: 1 for assistant tokens, 0 for system/user tokens

Packing

Multiple sequences packed into fixed-length bins (4096 tokens)

Mask Rolling

Shifts mask by 1 position for next-token prediction alignment

For data preparation implementation, see Recipe Source: src/nemotron/recipes/super3/stage1_sft/data_prep.py

Loss Masking#

Loss masking determines which tokens contribute to the training loss. In SFT, we only want the model to learn to generate responsesβ€”not to predict prompts or system instructions.

Role

Loss Mask

Training Signal

system

0

Ignored (instructions)

user

0

Ignored (prompts)

assistant

1

Learned (responses)

Packed Sequences#

Individual chat conversations vary in length. Packing concatenates multiple conversations into a single fixed-length sequence (default 4096 tokens), maximizing GPU utilization.

The packed sequence format stores everything Megatron-Bridge needs for training:

Field

Description

input_ids

Concatenated token IDs from multiple conversations

loss_mask

Rolled mask indicating which positions contribute to loss

seq_start_id

Boundary indices marking where each original conversation starts within the pack

Megatron-Bridge uses seq_start_id boundaries for variable-length attention (preventing cross-conversation attention leak) and FlashAttention optimization.

SFT Data Domains#

SFT data blend distribution

The SFT dataset covers 15+ domains across 7M total samples:

Reused from Nano3#

Domain

Description

Chat

General conversational data

Infinibyte

Cross-domain synthesis

Formal Proofs

Mathematical proof generation

Refreshed with New Teachers (DeepSeek v3.2, Kimi K2)#

Domain

Description

Competition Math

Competitive mathematics problems

Competition Code

Competitive programming problems

Conversational Tool Use

Multi-turn tool-using interactions (fully rebuilt pipeline: scaled from 5 domains/15K conversations in Nano3 to 838 domains/279K conversations)

Multilingual

Translations to 6 languages (DE, ES, FR, IT, JA, ZH) with format compliance post-editing

Science

Scientific reasoning

New Domains#

Domain

Description

Scale

Software Engineering

Coding tasks from GitHub issues (SWE-Gym, R2E-Gym, SWE-rebench) distilled via OpenHands with Qwen3-Coder-480B

β€”

Agentic Programming

Agentic CLI tasks: solution synthesis, SWE tasks, web development across Codex/OpenCode/Qwen Code CLI harnesses

~28K tasks

Long Context

Multi-document QA at 128K–512K context with multi-hop reasoning (4–7 retrieval steps) + 7 synthetic sequential reasoning tasks

β€”

Financial Reasoning

Template-based SDG from SecQue benchmark across S&P 500 companies and fiscal years 2019–2024

366K QA pairs

CUDA

Kernel generation, repair, and optimization from DeepSeek-R1/GPT-OSS-120B with CUDA evaluation validation

100K samples

Safety

Diverse prompts (content safety, jailbreak, over-safety, bias, prompt injection, copyright) with deliberative alignment reasoning traces

β€”

Search

Multi-hop search agent trajectories grounded in Wikidata knowledge graph walks (4–8 hops, ~12 tool calls per trajectory)

~7K records

Terminal Use

Terminal skill taxonomy with synthetic + adapted tasks, using DeepSeek-V3.2 in Dockerized agentic execution loop

84K samples

SQL

Text-to-SQL across MySQL/PostgreSQL/SQLite, 60 industries, ~700 topics, 90 SQL concept buckets

96.5K records

Reasoning Control#

Nemotron 3 Super supports three reasoning modes:

Mode

Description

Reasoning-off

No reasoning traces (3% of samples have reasoning stripped)

Regular reasoning

Standard chain-of-thought reasoning

Low-effort reasoning

New: shorter reasoning traces generated by GPT-OSS-120B low-effort mode (2% of SFT data)

Both regular and low-effort modes support inference-time budget control. After the main SFT stage, a short semi-on-policy SFT stage (350 steps, where rollouts are collected from the current model checkpoint) fine-tunes budget control by collecting rollouts and truncating 12% of reasoning traces to random reasoning budgets.

Hyperparameters#

Full SFT (default)#

Parameter

Value

Learning Rate

1e-5 (constant after warmup)

LR Warmup

30,000 samples (linear ramp to constant LR)

Sequence Packing

Up to 256K context

Global Batch Size

64

Micro-Batch Size

1

Pack Size

4096 tokens

Loss Masking

Role-based (assistant tokens only)

Loss Normalization

Two-stage (token-level then sample-level)

Optimizer

AdamW (beta1=0.9, beta2=0.95)

Weight Decay

0.1

Precision

BF16 mixed

MTP Loss Scaling

0.3

LoRA Fine-Tuning#

Parameter

Value

Learning Rate

1e-4

Target Modules

linear_qkv, linear_proj, linear_fc1, linear_fc2, in_proj, out_proj

Parallelism

TP=1, EP=1

Troubleshooting#

Common data preparation errors and solutions:

Error

Cause

Solution

Empty sequences after processing

All tokens masked (no assistant content)

Verify input data contains assistant responses

Template rendering mismatch

Tokenizer BPE splits differ from template expectations

Ensure tokenizer model matches the one used during template creation

Sequences truncated excessively

Many conversations exceed max_doc_tokens

Consider increasing max_doc_tokens or pack_size

Debugging tips:

  • Use --sample 100 to test data preparation on a small subset

  • Check metadata.json output for statistics on filtered/truncated sequences

  • Review W&B artifacts for lineage tracking and validation metrics


Recipe Execution#

Quick Start#

// 1. Prepare data (apply chat templates, tokenize to Packed Parquet)
$ uv run nemotron super3 data prep sft --run YOUR-CLUSTER

// 2. Run SFT
$ uv run nemotron super3 sft --run YOUR-CLUSTER

Note: The --run YOUR-CLUSTER flag submits jobs via NeMo-Run. See Execution through NeMo-Run for setup.

Direct Script Execution (Megatron-Bridge)#

For direct execution outside this CLI, use the scripts in the Megatron-Bridge repository:

# Clone the repository and checkout the super-v3 branch
git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
cd Megatron-Bridge
git checkout super-v3

# Full-parameter SFT (inside container on compute node)
torchrun --nproc-per-node=8 examples/models/nemotron_3/finetune_nemotron_3_super.py \
    logger.wandb_project=your_project \
    logger.wandb_entity=nvidia \
    logger.log_interval=5 \
    checkpoint.save=/path/to/checkpoints \
    checkpoint.load=/path/to/checkpoints \
    checkpoint.pretrained_checkpoint=/path/to/pretrained/ckpt \
    checkpoint.save_interval=50 \
    train.global_batch_size=16 \
    train.train_iters=200 \
    scheduler.lr_warmup_iters=10 \
    model.tensor_model_parallel_size=4 \
    model.sequence_parallel=True

# LoRA fine-tuning
torchrun --nproc-per-node=8 examples/models/nemotron_3/finetune_nemotron_3_super.py \
    --peft lora \
    checkpoint.pretrained_checkpoint=/path/to/pretrained/ckpt \
    train.global_batch_size=4 \
    train.train_iters=200 \
    model.tensor_model_parallel_size=4 \
    model.context_parallel_size=2 \
    model.sequence_parallel=True

See the Megatron-Bridge Nemotron 3 Super documentation for detailed configuration options.

Configuration#

File

Purpose

config/default.yaml

Production configuration (full SFT)

config/tiny.yaml

Quick testing configuration

config/data_prep/default.yaml

Data preparation settings

Data Preparation#

The data_prep.py script processes OpenAI-format chat data into packed sequences with role-based loss masking. See Data Preparation Module for detailed documentation.

CLI Command#

uv run nemotron super3 data prep sft [options]

Option

Description

--run <profile>

Execute on Slurm via NeMo-Run

--sample N

Limit rows per dataset (for testing)

--force

Force re-run, ignoring cache

Output#

output/stage1_sft/
β”œβ”€β”€ blend.json
β”œβ”€β”€ splits/
β”‚   β”œβ”€β”€ train/
β”‚   β”‚   β”œβ”€β”€ shard_000000.parquet
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ valid/
β”‚   └── test/
└── runs/{run_hash}/
    └── datasets/{name}/{hash}/

The output is registered as a W&B Artifact (SFTDataArtifact-sft) for lineage tracking.

Training#

CLI Command#

uv run nemotron super3 sft [options] [overrides...]

Option

Description

--run <profile>

Attachedβ€”submits and waits, streaming logs (NeMo-Run)

--batch <profile>

Detachedβ€”submits and exits immediately (NeMo-Run)

--dry-run

Preview execution plan

key=value

Override config values (NeMo-Run)

Override Examples#

# More training iterations
uv run nemotron super3 sft train.train_iters=5000

# Different learning rate
uv run nemotron super3 sft optimizer.lr=1e-5

# Load specific pretrained checkpoint
uv run nemotron super3 sft checkpoint.pretrained_checkpoint=/path/to/checkpoint

Running with NeMo-Run#

Configure execution profiles in env.toml:

[wandb]
project = "nemotron"
entity = "YOUR-TEAM"

[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 4
ntasks_per_node = 8
gpus_per_node = 8
mounts = ["/lustre:/lustre"]

See Execution through NeMo-Run for complete configuration options.

Artifact Lineage#

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart TB
    prev["ModelArtifact-pretrain<br/>(from Stage 0)"] --> train
    inst["Instruction Datasets<br/>(OpenAI chat format)"] --> dp["data_prep.py"]
    dp --> data["SFTDataArtifact-sft<br/>(Packed Parquet)"]
    data --> train["train.py"]
    train --> model["ModelArtifact-sft<br/>(fine-tuned checkpoint)"]

    style prev fill:#e1f5fe,stroke:#2196f3
    style inst fill:#f3e5f5,stroke:#9c27b0
    style dp fill:#f3e5f5,stroke:#9c27b0
    style data fill:#f3e5f5,stroke:#9c27b0
    style train fill:#f3e5f5,stroke:#9c27b0
    style model fill:#f3e5f5,stroke:#9c27b0
    

Infrastructure#

This stage uses the following components from the NVIDIA AI Stack:

Component

Role

Documentation

Megatron-Core

Distributed training primitives (TP, PP, DP, EP)

GitHub

Megatron-Bridge

Fine-tuning loop, checkpoint loading, loss masking

Docs

Key Features Used#

Feature

Purpose

finetune() entry point

SFT training with pre-loaded checkpoint

Role-based loss masking

Only compute loss on assistant tokens

Two-stage loss

Token-level then sample-level normalization

Mixed precision (BF16)

Memory-efficient training

Packed Parquet sequences

Efficient variable-length sequence handling (up to 256K)

Multi-token prediction

Continued MTP training during SFT (loss scaling 0.3)

Parallelism Configuration#

Full SFT (default, peft: null)#

Parallelism

Default

Config Key

Tensor (TP)

1

model.tensor_model_parallel_size

Pipeline (PP)

1

model.pipeline_model_parallel_size

Expert (EP)

8

model.expert_model_parallel_size

Expert Tensor (ETP)

1

model.expert_tensor_parallel_size

Sequence (SP)

Yes

model.sequence_parallel

Data (DP)

Auto

Computed from world size

LoRA (peft: lora)#

Parallelism

Default

Config Key

Tensor (TP)

1

model.tensor_model_parallel_size

Pipeline (PP)

1

model.pipeline_model_parallel_size

Expert (EP)

1

model.expert_model_parallel_size

Sequence (SP)

Yes

model.sequence_parallel

Minimum resources: 4 nodes with 8 GPUs each (32 GPUs total).

Container#

nvcr.io/nvidia/nemo:26.02.nemotron_3_super

Next Steps#

After SFT completes, proceed to Stage 2: RL for reinforcement learning alignment.

Reference#