Model Optimization#

This directory contains comprehensive documentation for optimizing models with Megatron Bridge using NVIDIA ModelOpt. Learn how to apply various optimization techniques to improve inference efficiency while maintaining model quality.

Overview#

NVIDIA ModelOpt provides a suite of model optimization techniques for improving inference performance:

Quantization - Convert models from high-precision (FP32/BF16) to lower-precision formats (FP8, INT8, INT4) for efficient deployment
Distillation - Transfer knowledge from a pre-trained teacher model to a smaller, faster student model
Pruning - Reduce model size by removing layers (depth) or reducing dimensions (width) such as attention heads and hidden sizes

Quick Navigation#

I want to#

🔧 Quantize a pretrained model → See the Post-Training Quantization section for complete PTQ workflows (quantize, resume and generate, export)

🏋️ Train with quantization → Check the Quantization-Aware Training section for QAT workflows

🎓 Distill a teacher model into a smaller student → See Knowledge Distillation for the KD configuration and recipe workflow

✂️ Prune a model to reduce size → See Pruning for Minitron-based pruning to a target parameter count or fixed architecture

References#

NVIDIA ModelOpt