nemo_automodel.components.quantization.fp8#

Module Contents#

Classes#

FP8Config

Configuration for FP8 quantization settings.

Functions#

_has_cuda_capability

Check if CUDA device has required compute capability.

_module_filter_fn

Filter function to exclude certain modules from FP8 conversion.

apply_fp8_to_model

Apply FP8 quantization to a PyTorch model using torchao.

verify_fp8_conversion

Verify that FP8 conversion was successful by counting converted modules.

Data#

API#

nemo_automodel.components.quantization.fp8.logger#

‘getLogger(…)’

class nemo_automodel.components.quantization.fp8.FP8Config[source]#

Configuration for FP8 quantization settings.

recipe_name: Optional[Literal[tensorwise, rowwise, rowwise_with_gw_hp]]#

None

FP8 recipe to use. If None, uses tensorwise scaling with manual configuration.

enable_fsdp_float8_all_gather: bool#

False

Whether to enable float8 all-gather in FSDP, recommended for tensorwise scaling.

precompute_float8_dynamic_scale_for_fsdp: bool#

False

Whether to precompute float8 scales dynamically for FSDP, recommended for tensorwise scaling.

force_recompute_fp8_weight_in_bwd: bool#

False

Whether to force the recomputation of FP8 weights during backward pass.

filter_fqns: List[str]#

‘field(…)’

List of fully qualified names of modules to skip applying float8 training to. nn.Linear modules with any dim size not divisible by 16 are always skipped due to hardware requirements. Example: [“attention.wq”, “attention.wk”, “attention.wv”, “lm_head”]

emulate: bool#

False

If True, emulation is used instead of hardware accelerated gemm. This is for test purpose only

classmethod from_config_node(config_node)[source]#

Create FP8Config from a configuration node.

to_dict()[source]#
nemo_automodel.components.quantization.fp8._has_cuda_capability(major: int, minor: int) bool[source]#

Check if CUDA device has required compute capability.

nemo_automodel.components.quantization.fp8._module_filter_fn(module, name, filter_fqns: List[str] = None)[source]#

Filter function to exclude certain modules from FP8 conversion.

Parameters:
  • module – The module to check

  • name – Fully qualified name of the module

  • filter_fqns – List of FQNs to filter out

Returns:

True if module should be converted to FP8, False otherwise

nemo_automodel.components.quantization.fp8.apply_fp8_to_model(
model: torch.nn.Module,
filter_fqns: Optional[List[str]] = None,
recipe_name: Optional[str] = None,
force_recompute_fp8_weight_in_bwd: bool = False,
enable_fsdp_float8_all_gather: bool = False,
emulate: bool = False,
) torch.nn.Module[source]#

Apply FP8 quantization to a PyTorch model using torchao.

Parameters:
  • model – The model to convert

  • filter_fqns – List of module names to exclude from FP8 conversion

  • recipe_name – Recipe name for FP8 configuration (“tensorwise”, “rowwise”, etc.)

  • force_recompute_fp8_weight_in_bwd – Whether to force recompute FP8 weight in backward pass

  • enable_fsdp_float8_all_gather – Whether to enable FSDP FP8 all-gather

  • emulate – Use emulation instead of hardware acceleration (for testing on older GPUs)

Returns:

The model with FP8 linear layers (modified in-place)

Raises:
  • ImportError – If torchao is not installed

  • ValueError – If hardware doesn’t support FP8 and emulation is disabled

nemo_automodel.components.quantization.fp8.verify_fp8_conversion(model: torch.nn.Module) dict[source]#

Verify that FP8 conversion was successful by counting converted modules.

Parameters:

model – The model to verify

Returns:

Dict with conversion statistics