nemo_automodel.components.utils.flops_utils

Module Contents

Functions

Name	Description
`_build_moe_layer_pattern`	Build a list of 0/1 indicating dense(0) vs MoE(1) per layer.
`_gdn_attention_per_layer_flops`	FLOPs for a single Gated DeltaNet (GDN / linear attention) layer.
`_hybrid_model_flops`	Model FLOPs for hybrid model
`_mamba_layer_flops`	Model FLOPs for Mamba layer.
`_mla_attention_per_layer_flops`	Per-layer FLOPs for Multi-Latent Attention (MLA).
`_mla_moe_model_flops`	FLOPs for MLA + MoE transformer models (DeepSeek-V3 style).
`_nemotronh_mlp_layer_flops`	Model FLOPs for MLP layer. Assume gated linear unit.
`_nemotronh_moe_layer_flops`	Model FLOPs for a MoE layer in Nemotron V3/Super V3 (hybrid Mamba/Attention/MoE).
`_nemotronh_mtp_flops`	Model FLOPs for the Multi-Token-Prediction (MTP) head of Nemotron-3 Super/Ultra.
`_non_mla_attn_layer_flops`	Model FLOPs for attention layer
`attention_flops_calculator`	Calculate the flops for the attention part.
`bert_flops`	Model FLOPs for BERT family - accepts either AutoConfig or normalized config
`calculate_mfu`	Calculate Model FLOPs Utilization (MFU).
`clip_vit_l_flops`	Model FLOPs for CLIP ViT
`deepseekv3_flops`	Model FLOPs for DeepSeek V3 - accepts either AutoConfig or normalized config
`flux_flops`	Model FLOPs for FLUX
`get_flops_formula_for_hf_config`	Get the appropriate FLOPs formula function for a given HuggingFace config.
`glm4_moe_flops`	Estimate FLOPs for GLM4 MoE model configurations.
`gpt3_flops`	Model FLOPs for GPT3 family - accepts either AutoConfig or normalized config
`gpt_oss_flops`	Model FLOPs for GPT-OSS
`gpt_oss_flops_calculator`	Calculate the flops for the GPT-OSS model
`llama2_flops`	Model FLOPs for llama2 family - accepts either AutoConfig or normalized config
`llama3_flops`	Model FLOPs for llama3 family - accepts either AutoConfig or normalized config
`loss_flops_calculator`	Calculate the flops for the loss
`minimax_m2_flops`	Model FLOPs for MiniMax-M2 family - accepts either AutoConfig or normalized config.
`mixtral_flops`	Model FLOPs for mixtral family - accepts either AutoConfig or normalized config
`mla_moe_flops`	Model FLOPs for MLA + MoE models (Kimi K2, GLM-5, Mistral Small 4, etc.).
`moe_mlp_flops_calculator`	Calculate the flops for the MLP
`nemotron_flops`	Model FLOPs for nemotron family - accepts either AutoConfig or normalized config
`nemotronh_flops`	Model FLOPs for NemotronH
`neva_projection_flops`	Model FLOPs for NeVA Projection
`qwen3_5_flops`	Model FLOPs for Qwen3.5 family (MoE and Dense) with hybrid GDN/full attention.
`qwen3_flops`	Model FLOPs for Qwen3 family - accepts either AutoConfig or normalized config
`step3_5_flash_flops`	Model FLOPs for Step3.5-Flash (GQA + sliding-window / full attention + MoE).
`transformer_flops`	Calculate FLOPs for a standard Transformer model - accepts either AutoConfig or normalized config.

API

nemo_automodel.components.utils.flops_utils._build_moe_layer_pattern(
    config,
    layers
)

Build a list of 0/1 indicating dense(0) vs MoE(1) per layer.

Handles multiple config styles: first_k_dense_replace + moe_layer_freq, mlp_layer_types list, etc.

nemo_automodel.components.utils.flops_utils._gdn_attention_per_layer_flops(
    gbs,
    seq_len,
    hidden_size,
    linear_key_head_dim,
    linear_value_head_dim,
    linear_num_key_heads,
    linear_num_value_heads,
    linear_conv_kernel_dim
)

FLOPs for a single Gated DeltaNet (GDN / linear attention) layer.

Based on the GDN FLOPs calculator from Megatron-Bridge PR #2925.

nemo_automodel.components.utils.flops_utils._hybrid_model_flops(
    config,
    gbs,
    seq_len
)

Model FLOPs for hybrid model

nemo_automodel.components.utils.flops_utils._mamba_layer_flops(
    config,
    gbs,
    seq_len
)

Model FLOPs for Mamba layer.

Multiplied by 6 (3x fwd+bwd * 2x FMA) for in_proj/out_proj (standard GEMMs), and 7 * 3 = 21 for scan (non-GEMM kernel, higher op count per element).

nemo_automodel.components.utils.flops_utils._mla_attention_per_layer_flops(
    gbs,
    seq_len,
    hs,
    attention_heads,
    q_lora_rank,
    kv_lora_rank,
    qk_rope_head_dim,
    qk_nope_head_dim,
    v_head_dim,
    index_topk = None,
    index_n_heads = 0,
    index_head_dim = 0
)

Per-layer FLOPs for Multi-Latent Attention (MLA).

Shared by DeepSeek V3, Kimi K2.5, Mistral Small 4, GLM-5, etc.

When index_topk is set (DSA / sparse attention), accounts for:

Sparse main attention BMM: S * index_topk instead of 0.5 * S^2
DSA indexer overhead: Q/K/weights projections + full S^2 indexer BMM

nemo_automodel.components.utils.flops_utils._mla_moe_model_flops(
    gbs,
    seq_len,
    hs,
    layers,
    attention_heads,
    vocab_size,
    q_lora_rank,
    kv_lora_rank,
    qk_rope_head_dim,
    qk_nope_head_dim,
    v_head_dim,
    dense_ffn_hs,
    moe_ffn_hs,
    moe_router_topk,
    moe_shared_expert_hs,
    moe_layer_pattern,
    mtp_num_layers = 0,
    index_topk = None,
    index_n_heads = 0,
    index_head_dim = 0
)

FLOPs for MLA + MoE transformer models (DeepSeek-V3 style).

Parameters:

moe_layer_pattern

List of 0/1 per layer (0=dense, 1=MoE).

moe_shared_expert_hs

Total intermediate size for all shared experts combined.

index_topk

Defaults to None

If set, use DSA sparse attention with this many selected positions.

index_n_heads

Defaults to 0

Number of heads in the DSA indexer.

index_head_dim

Defaults to 0

Head dimension of the DSA indexer.

nemo_automodel.components.utils.flops_utils._nemotronh_mlp_layer_flops(
    config,
    gbs,
    seq_len
)

Model FLOPs for MLP layer. Assume gated linear unit.

nemo_automodel.components.utils.flops_utils._nemotronh_moe_layer_flops(
    config,
    gbs,
    seq_len
)

Model FLOPs for a MoE layer in Nemotron V3/Super V3 (hybrid Mamba/Attention/MoE).

Nemotron V3 uses relu2 (non-gated) for both routed and shared experts, so each expert has 2 linear projections (up_proj + down_proj), not 3.

When moe_latent_size is set (Super V3), routed experts operate in a reduced latent space with additional projection layers (fc1_latent_proj, fc2_latent_proj). The shared expert and gate always operate in the full hidden_size dimension.

nemo_automodel.components.utils.flops_utils._nemotronh_mtp_flops(
    config,
    gbs,
    seq_len,
    num_mtp_layers,
    mtp_block_types,
    use_repeated_layer
)

Model FLOPs for the Multi-Token-Prediction (MTP) head of Nemotron-3 Super/Ultra.

The head predicts num_mtp_layers (N) additional tokens. Each of the N depths runs the MTP block pattern once, plus a depth-fusion projection (eh_proj: cat[enorm(embed), hnorm(hidden)] of size 2*hidden -> hidden) and a vocab projection (weight-tied lm_head).

Repeated layer: when use_repeated_layer is True the model builds a SINGLE physical depth (mtp_block_types lists its sublayers) and reuses it across all N depths — but it still EXECUTES once per depth, so the block compute is N x the physical sublayers. That N x is the repeated layer’s FLOPs. When False, mtp_block_types already spans all N physical depths, so the block runs once.

These settings are NOT fully recoverable from the HF config (it retains only the physical depth count and omits the block pattern), so callers pass the effective values read from the built model: num_mtp_layers = model.mtp_config.num_layers and mtp_block_types = [s.block_type for s in model.mtp.layers].

Parameters:

num_mtp_layers

effective MTP depths actually run (model.mtp_config.num_layers).

mtp_block_types

block types (“mamba”/“attention”/“mlp”/“moe”) of the physical MTP sublayers (model.mtp.layers).

use_repeated_layer

True if the physical depth is reused across the N depths.

nemo_automodel.components.utils.flops_utils._non_mla_attn_layer_flops(
    config,
    gbs,
    seq_len
)

Model FLOPs for attention layer

nemo_automodel.components.utils.flops_utils.attention_flops_calculator(
    seqlen,
    hidden_size,
    num_attention_heads,
    num_query_groups,
    kv_channels: typing.Optional[int] = None,
    is_swa: bool = False,
    swa_window_size: int = 128
)

Calculate the flops for the attention part.

nemo_automodel.components.utils.flops_utils.bert_flops(
    config,
    gbs = 1,
    seq_len = None
)

Model FLOPs for BERT family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.calculate_mfu(
    tflops,
    world_size,
    time_seconds,
    reference_mfu = 1979.0
)

Calculate Model FLOPs Utilization (MFU).

Parameters:

tflops

TFLOPs per GPU

world_size

Total number of GPUs

time_seconds

Time taken for computation

reference_mfu

Defaults to 1979.0

Peak TFLOPs of the hardware (default: H100)

Returns:

MFU as a percentage

nemo_automodel.components.utils.flops_utils.clip_vit_l_flops(
    config
)

Model FLOPs for CLIP ViT

nemo_automodel.components.utils.flops_utils.deepseekv3_flops(
    config,
    gbs = 1,
    seq_len = None
)

Model FLOPs for DeepSeek V3 - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.flux_flops(
    config
)

Model FLOPs for FLUX

nemo_automodel.components.utils.flops_utils.get_flops_formula_for_hf_config(
    config: typing.Any
) -> typing.Optional[typing.Callable]

Get the appropriate FLOPs formula function for a given HuggingFace config.

Parameters:

config

Any

HuggingFace model config object

Returns: Optional[Callable]

The appropriate FLOPs formula function, or None if model type is not supported

nemo_automodel.components.utils.flops_utils.glm4_moe_flops(
    config,
    gbs = 1,
    seq_len = None
)

Estimate FLOPs for GLM4 MoE model configurations.

nemo_automodel.components.utils.flops_utils.gpt3_flops(
    config,
    gbs = 1,
    seq_len = None
)

Model FLOPs for GPT3 family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.gpt_oss_flops(
    config,
    gbs = 1,
    seq_len = None
)

Model FLOPs for GPT-OSS

nemo_automodel.components.utils.flops_utils.gpt_oss_flops_calculator(
    gbs,
    num_layers,
    seqlen,
    hidden_size,
    num_attention_heads,
    num_query_groups,
    moe_ffn_hidden_size,
    moe_router_topk,
    vocab_size,
    kv_channels: typing.Optional[int] = None,
    swa_window_size: int = 128,
    window_attn_skip_freq: typing.Optional[int] = 2
)

Calculate the flops for the GPT-OSS model

nemo_automodel.components.utils.flops_utils.llama2_flops(
    config,
    gbs = 1,
    seq_len = None
)

Model FLOPs for llama2 family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.llama3_flops(
    config,
    gbs = 1,
    seq_len = None
)

Model FLOPs for llama3 family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.loss_flops_calculator(
    seqlen,
    hidden_size,
    vocab_size
)

Calculate the flops for the loss

nemo_automodel.components.utils.flops_utils.minimax_m2_flops(
    config,
    gbs = 1,
    seq_len = None
)

Model FLOPs for MiniMax-M2 family - accepts either AutoConfig or normalized config.

Architecture: GQA attention (Q/K/V/O separate projections, head_dim may differ from hidden_size // num_heads) + MoE with SwiGLU (no shared experts by default). Optionally includes MTP (Multi-Token Prediction) modules gated by use_mtp.

nemo_automodel.components.utils.flops_utils.mixtral_flops(
    config,
    gbs = 1,
    seq_len = None
)

Model FLOPs for mixtral family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.mla_moe_flops(
    config,
    gbs = 1,
    seq_len = None
)

Model FLOPs for MLA + MoE models (Kimi K2, GLM-5, Mistral Small 4, etc.).

Handles VL wrappers by extracting text_config if present.

nemo_automodel.components.utils.flops_utils.moe_mlp_flops_calculator(
    seqlen,
    hidden_size,
    moe_ffn_hidden_size,
    moe_router_topk,
    gated_linear_unit: bool = True
)

Calculate the flops for the MLP

nemo_automodel.components.utils.flops_utils.nemotron_flops(
    config,
    gbs = 1,
    seq_len = None
)

Model FLOPs for nemotron family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.nemotronh_flops(
    config,
    gbs = 1,
    seq_len = None
)

Model FLOPs for NemotronH

nemo_automodel.components.utils.flops_utils.neva_projection_flops(
    config
)

Model FLOPs for NeVA Projection

nemo_automodel.components.utils.flops_utils.qwen3_5_flops(
    config,
    gbs = 1,
    seq_len = None
)

Model FLOPs for Qwen3.5 family (MoE and Dense) with hybrid GDN/full attention.

Qwen3.5 uses a hybrid attention pattern: 75% GDN (linear attention) layers and 25% standard GQA (full attention) layers (full_attention_interval=4). Supports both the MoE variant (Qwen3.5-35B-A3B) and Dense variant (Qwen3.5-27B).

nemo_automodel.components.utils.flops_utils.qwen3_flops(
    config,
    gbs = 1,
    seq_len = None
)

Model FLOPs for Qwen3 family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.step3_5_flash_flops(
    config,
    gbs = 1,
    seq_len = None
)

Model FLOPs for Step3.5-Flash (GQA + sliding-window / full attention + MoE).

Architecture: hybrid full/SWA attention with different head counts per type, MoE with shared expert on most layers, first few layers dense, SwiGLU.

nemo_automodel.components.utils.flops_utils.transformer_flops(
    config,
    gbs = 1,
    seq_len = None
)

Calculate FLOPs for a standard Transformer model - accepts either AutoConfig or normalized config. Note: This does not cover encoder-decoder models.