Run Post-Training Optimization#
Optimization steps apply NVIDIA Model Optimizer flows after you have a trained model that is worth compressing. Run them only on checkpoints that already passed your quality bar.
Steps#
Step id |
Purpose |
Typical input |
Output |
|---|---|---|---|
|
Post-training quantization |
|
|
|
Structured pruning |
|
|
|
Teacher-student recovery |
Teacher and student |
|
Decision Flow#
If you only need a smaller numeric type for inference, start with
optimize/modelopt/quantize. Pick a quantization recipe that matches your hardware. FP8 suits Hopper. NVFP4 suits Blackwell. Readstep.tomlfor parameter names and allowed values.If you need a smaller architecture, use
optimize/modelopt/prune.If quality drops after compression, use
optimize/modelopt/distillwith a full-precision teacher while the student matches the compressed artifact.Do not run optimization on unmerged adapters. When the source was parameter-efficient fine tuning (PEFT) with low-rank adaptation (LoRA), merge LoRA into a base checkpoint first.
Order of Operations#
Distillation after pruning is the usual recovery order when both apply. Quantization is largely independent. Quantization still needs a benchmark before and after you run it.
Sample Commands#
$ uv run nemotron steps run optimize/modelopt/quantize -c tiny
$ uv run nemotron steps run optimize/modelopt/prune -c tiny
$ uv run nemotron steps run optimize/modelopt/distill -c tiny
Tiny runs and mock-data runs validate end-to-end execution only. Judge final quality on full calibration or distillation data.
Success Criteria#
You keep the original high-precision checkpoint and its evaluation baseline.
Post-optimization evaluation uses the same benchmark suite as pre-optimization evaluation.