Model Optimization#
This directory contains comprehensive documentation for optimizing models with Megatron Bridge using NVIDIA ModelOpt. Learn how to apply various optimization techniques to improve inference efficiency while maintaining model quality.
Overview#
NVIDIA ModelOpt provides a suite of model optimization techniques for improving inference performance:
Quantization - Convert models from high-precision (FP32/BF16) to lower-precision formats (FP8, INT8, INT4) for efficient deployment
Distillation - Transfer knowledge from a pre-trained teacher model to a smaller, faster student model
Pruning - Reduce model size by removing layers (depth) or reducing dimensions (width) such as attention heads and hidden sizes