Model Optimization#

This directory contains comprehensive documentation for optimizing models with Megatron Bridge using NVIDIA ModelOpt. Learn how to apply various optimization techniques to improve inference efficiency while maintaining model quality.

Overview#

NVIDIA ModelOpt provides a suite of model optimization techniques for improving inference performance:

  • Quantization - Convert models from high-precision (FP32/BF16) to lower-precision formats (FP8, INT8, INT4) for efficient deployment

  • Distillation - Transfer knowledge from a pre-trained teacher model to a smaller, faster student model

  • Pruning - Reduce model size by removing layers (depth) or reducing dimensions (width) such as attention heads and hidden sizes

Quick Navigation#

I want to#

🔧 Quantize a pretrained model → See the Post-Training Quantization section for complete PTQ workflows (quantize, resume and generate, export)

🏋️ Train with quantization → Check the Quantization-Aware Training section for QAT workflows

References#