distill#

This step runs teacher-student distillation by using NVIDIA Model Optimizer through NVIDIA Megatron-Bridge. The step can run as a standalone training job, or as a quality-recovery pass after pruning or quantization. Real-data runs consume Megatron bin/idx data produced by data_prep/pretrain_prep. The step produces a distilled Megatron distributed checkpoint.

Syntax#

nemotron steps run optimize/modelopt/distill \
    [-c <config-name-or-path>] \
    [-r <run-profile> | -b <batch-profile>] \
    [-d] \
    [--force-squash] \
    [<dotlist-overrides>...] \
    [<passthrough-args>...]

Refer to the Nemotron Steps CLI Reference for the shared flag set.

Configuration Files#

The step ships two configuration files under src/nemotron/steps/optimize/modelopt/distill/config/.

File	Purpose
`default.yaml`	Generic teacher-student distillation configuration with `Qwen/Qwen3-8B` as the teacher and `Qwen/Qwen3-4B` as the student.
`tiny.yaml`	Short validation run that exercises the distillation pipeline with mock data.

Pass the configuration name with -c:

$ nemotron steps run optimize/modelopt/distill -c tiny
$ nemotron steps run optimize/modelopt/distill -c default

Inputs and Outputs#

Direction	Artifact Type	Required	Description
Consumes	`checkpoint_hf`	Yes	The teacher and student Hugging Face (HF) checkpoints.
Consumes	`binidx`	No	Optional real distillation data from `data_prep/pretrain_prep`. This input is unnecessary when `args.use_mock_data=true`.
Produces	`checkpoint_megatron`	—	The distilled Megatron distributed checkpoint.

Step Parameters#

The manifest declares five distillation parameters. Pass them as dotlist overrides.

args.teacher_hf_path=<id-or-path>#

The Hugging Face identifier or local path for the teacher checkpoint.

Example: args.teacher_hf_path=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

args.student_hf_path=<id-or-path>#

The Hugging Face identifier or local path for the student checkpoint.

Example: args.student_hf_path=Qwen/Qwen3-4B

args.data_paths=<list>#

The Megatron data blend, expressed as the upstream command-line sequence in the form [weight, prefix, weight, prefix, ...].

Example: args.data_paths='[0.5, /lustre/data/wiki, 0.5, /lustre/data/c4]'

args.use_mock_data=<bool>#

When set to true, the step runs a validation pass with mock data instead of real Megatron bin/idx data.

Default: false.

Example: args.use_mock_data=true

extra_args=<list>#

Literal upstream arguments that the step forwards to the distillation script. Use this parameter to pass newly added Model Optimizer flags that do not yet have a dedicated args.* entry.

Default: [].

Example: extra_args=["--hf_export_path", "/lustre/distilled/hf"]

Frequently used dotlist overrides drawn from the default configuration include the following.

args.tp_size=<n>#

The tensor-parallel degree applied during distillation.

Example: args.tp_size=4

args.train_iters=<n>#

The number of training iterations.

Example: args.train_iters=2000

args.seq_length=<n>#

The training sequence length.

Example: args.seq_length=4096

Strategies#

The manifest records three operator strategies for optimize/modelopt/distill.

When you recover quality after pruning or quantization, set the original BF16 checkpoint as the teacher and the optimized checkpoint as the student.
When you validate the pipeline, set args.use_mock_data=true, args.seq_length=512, args.train_iters=100, and a small args.eval_iters value.
When you need a Hugging Face checkpoint, set args.hf_export_path and args.student_hf_model, or convert a saved Megatron iteration after the run completes.

Command Examples#

Run the tiny validation configuration locally with mock data:

$ nemotron steps run optimize/modelopt/distill -c tiny

Compile the default configuration without submitting the job:

$ nemotron steps run optimize/modelopt/distill -c default --dry-run

Submit an attached distillation run on a Lepton profile with real data:

$ nemotron steps run optimize/modelopt/distill -c default -r lepton_optimize_modelopt_distill \
    args.teacher_hf_path=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    args.student_hf_path=Qwen/Qwen3-4B \
    args.data_paths='[0.5, /lustre/data/wiki, 0.5, /lustre/data/c4]' \
    args.train_iters=2000

Submit a detached pipeline-validation run on a Slurm profile with mock data:

$ nemotron steps run optimize/modelopt/distill -c default -b slurm_optimize_modelopt_distill \
    args.use_mock_data=true \
    args.seq_length=512 \
    args.train_iters=100