optimize/modelopt/distill#
This step runs teacher-student distillation by using NVIDIA Model Optimizer through NVIDIA Megatron-Bridge.
The step can run as a standalone training job, or as a quality-recovery pass after pruning or quantization.
Real-data runs consume Megatron bin/idx data produced by data_prep/pretrain_prep.
The step produces a distilled Megatron distributed checkpoint.
Syntax#
nemotron steps run optimize/modelopt/distill \
[-c <config-name-or-path>] \
[-r <run-profile> | -b <batch-profile>] \
[-d] \
[--force-squash] \
[<dotlist-overrides>...] \
[<passthrough-args>...]
Refer to the Nemotron Steps CLI Reference for the shared flag set.
Configuration Files#
The step ships two configuration files under src/nemotron/steps/optimize/modelopt/distill/config/.
File |
Purpose |
|---|---|
|
Generic teacher-student distillation configuration with |
|
Short validation run that exercises the distillation pipeline with mock data. |
Pass the configuration name with -c:
$ nemotron steps run optimize/modelopt/distill -c tiny
$ nemotron steps run optimize/modelopt/distill -c default
Inputs and Outputs#
Direction |
Artifact Type |
Required |
Description |
|---|---|---|---|
Consumes |
|
Yes |
The teacher and student Hugging Face (HF) checkpoints. |
Consumes |
|
No |
Optional real distillation data from |
Produces |
|
— |
The distilled Megatron distributed checkpoint. |
Step Parameters#
The manifest declares five distillation parameters. Pass them as dotlist overrides.
- args.teacher_hf_path=<id-or-path>#
The Hugging Face identifier or local path for the teacher checkpoint.
Example:
args.teacher_hf_path=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
- args.student_hf_path=<id-or-path>#
The Hugging Face identifier or local path for the student checkpoint.
Example:
args.student_hf_path=Qwen/Qwen3-4B
- args.data_paths=<list>#
The Megatron data blend, expressed as the upstream command-line sequence in the form
[weight, prefix, weight, prefix, ...].Example:
args.data_paths='[0.5, /lustre/data/wiki, 0.5, /lustre/data/c4]'
- args.use_mock_data=<bool>#
When set to
true, the step runs a validation pass with mock data instead of real Megatronbin/idxdata.Default:
false.Example:
args.use_mock_data=true
- extra_args=<list>#
Literal upstream arguments that the step forwards to the distillation script. Use this parameter to pass newly added Model Optimizer flags that do not yet have a dedicated
args.*entry.Default:
[].Example:
extra_args=["--hf_export_path", "/lustre/distilled/hf"]
Frequently used dotlist overrides drawn from the default configuration include the following.
- args.tp_size=<n>#
The tensor-parallel degree applied during distillation.
Example:
args.tp_size=4
- args.train_iters=<n>#
The number of training iterations.
Example:
args.train_iters=2000
- args.seq_length=<n>#
The training sequence length.
Example:
args.seq_length=4096
Strategies#
The manifest records three operator strategies for optimize/modelopt/distill.
When you recover quality after pruning or quantization, set the original BF16 checkpoint as the teacher and the optimized checkpoint as the student.
When you validate the pipeline, set
args.use_mock_data=true,args.seq_length=512,args.train_iters=100, and a smallargs.eval_itersvalue.When you need a Hugging Face checkpoint, set
args.hf_export_pathandargs.student_hf_model, or convert a saved Megatron iteration after the run completes.
Command Examples#
Run the tiny validation configuration locally with mock data:
$ nemotron steps run optimize/modelopt/distill -c tiny
Compile the default configuration without submitting the job:
$ nemotron steps run optimize/modelopt/distill -c default --dry-run
Submit an attached distillation run on a Lepton profile with real data:
$ nemotron steps run optimize/modelopt/distill -c default -r lepton_optimize_modelopt_distill \
args.teacher_hf_path=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
args.student_hf_path=Qwen/Qwen3-4B \
args.data_paths='[0.5, /lustre/data/wiki, 0.5, /lustre/data/c4]' \
args.train_iters=2000
Submit a detached pipeline-validation run on a Slurm profile with mock data:
$ nemotron steps run optimize/modelopt/distill -c default -b slurm_optimize_modelopt_distill \
args.use_mock_data=true \
args.seq_length=512 \
args.train_iters=100