> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.utils.flops_utils

## Module Contents

### Functions

| Name                                                                                                              | Description                                                                                        |
| ----------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| [`_build_moe_layer_pattern`](#nemo_automodel-components-utils-flops_utils-_build_moe_layer_pattern)               | Build a list of 0/1 indicating dense(0) vs MoE(1) per layer.                                       |
| [`_gdn_attention_per_layer_flops`](#nemo_automodel-components-utils-flops_utils-_gdn_attention_per_layer_flops)   | FLOPs for a single Gated DeltaNet (GDN / linear attention) layer.                                  |
| [`_hybrid_model_flops`](#nemo_automodel-components-utils-flops_utils-_hybrid_model_flops)                         | Model FLOPs for hybrid model                                                                       |
| [`_mamba_layer_flops`](#nemo_automodel-components-utils-flops_utils-_mamba_layer_flops)                           | Model FLOPs for Mamba layer.                                                                       |
| [`_mla_attention_per_layer_flops`](#nemo_automodel-components-utils-flops_utils-_mla_attention_per_layer_flops)   | Per-layer FLOPs for Multi-Latent Attention (MLA).                                                  |
| [`_mla_moe_model_flops`](#nemo_automodel-components-utils-flops_utils-_mla_moe_model_flops)                       | FLOPs for MLA + MoE transformer models (DeepSeek-V3 style).                                        |
| [`_nemotronh_mlp_layer_flops`](#nemo_automodel-components-utils-flops_utils-_nemotronh_mlp_layer_flops)           | Model FLOPs for MLP layer. Assume gated linear unit.                                               |
| [`_nemotronh_moe_layer_flops`](#nemo_automodel-components-utils-flops_utils-_nemotronh_moe_layer_flops)           | Model FLOPs for a MoE layer in Nemotron V3/Super V3 (hybrid Mamba/Attention/MoE).                  |
| [`_nemotronh_mtp_flops`](#nemo_automodel-components-utils-flops_utils-_nemotronh_mtp_flops)                       | Model FLOPs for the Multi-Token-Prediction (MTP) head of Nemotron-3 Super/Ultra.                   |
| [`_non_mla_attn_layer_flops`](#nemo_automodel-components-utils-flops_utils-_non_mla_attn_layer_flops)             | Model FLOPs for attention layer                                                                    |
| [`attention_flops_calculator`](#nemo_automodel-components-utils-flops_utils-attention_flops_calculator)           | Calculate the flops for the attention part.                                                        |
| [`bert_flops`](#nemo_automodel-components-utils-flops_utils-bert_flops)                                           | Model FLOPs for BERT family - accepts either AutoConfig or normalized config                       |
| [`calculate_mfu`](#nemo_automodel-components-utils-flops_utils-calculate_mfu)                                     | Calculate Model FLOPs Utilization (MFU).                                                           |
| [`clip_vit_l_flops`](#nemo_automodel-components-utils-flops_utils-clip_vit_l_flops)                               | Model FLOPs for CLIP ViT                                                                           |
| [`deepseekv3_flops`](#nemo_automodel-components-utils-flops_utils-deepseekv3_flops)                               | Model FLOPs for DeepSeek V3 - accepts either AutoConfig or normalized config                       |
| [`flux_flops`](#nemo_automodel-components-utils-flops_utils-flux_flops)                                           | Model FLOPs for FLUX                                                                               |
| [`get_flops_formula_for_hf_config`](#nemo_automodel-components-utils-flops_utils-get_flops_formula_for_hf_config) | Get the appropriate FLOPs formula function for a given HuggingFace config.                         |
| [`glm4_moe_flops`](#nemo_automodel-components-utils-flops_utils-glm4_moe_flops)                                   | Estimate FLOPs for GLM4 MoE model configurations.                                                  |
| [`gpt3_flops`](#nemo_automodel-components-utils-flops_utils-gpt3_flops)                                           | Model FLOPs for GPT3 family - accepts either AutoConfig or normalized config                       |
| [`gpt_oss_flops`](#nemo_automodel-components-utils-flops_utils-gpt_oss_flops)                                     | Model FLOPs for GPT-OSS                                                                            |
| [`gpt_oss_flops_calculator`](#nemo_automodel-components-utils-flops_utils-gpt_oss_flops_calculator)               | Calculate the flops for the GPT-OSS model                                                          |
| [`llama2_flops`](#nemo_automodel-components-utils-flops_utils-llama2_flops)                                       | Model FLOPs for llama2 family - accepts either AutoConfig or normalized config                     |
| [`llama3_flops`](#nemo_automodel-components-utils-flops_utils-llama3_flops)                                       | Model FLOPs for llama3 family - accepts either AutoConfig or normalized config                     |
| [`loss_flops_calculator`](#nemo_automodel-components-utils-flops_utils-loss_flops_calculator)                     | Calculate the flops for the loss                                                                   |
| [`minimax_m2_flops`](#nemo_automodel-components-utils-flops_utils-minimax_m2_flops)                               | Model FLOPs for MiniMax-M2 family - accepts either AutoConfig or normalized config.                |
| [`mixtral_flops`](#nemo_automodel-components-utils-flops_utils-mixtral_flops)                                     | Model FLOPs for mixtral family - accepts either AutoConfig or normalized config                    |
| [`mla_moe_flops`](#nemo_automodel-components-utils-flops_utils-mla_moe_flops)                                     | Model FLOPs for MLA + MoE models (Kimi K2, GLM-5, Mistral Small 4, etc.).                          |
| [`moe_mlp_flops_calculator`](#nemo_automodel-components-utils-flops_utils-moe_mlp_flops_calculator)               | Calculate the flops for the MLP                                                                    |
| [`nemotron_flops`](#nemo_automodel-components-utils-flops_utils-nemotron_flops)                                   | Model FLOPs for nemotron family - accepts either AutoConfig or normalized config                   |
| [`nemotronh_flops`](#nemo_automodel-components-utils-flops_utils-nemotronh_flops)                                 | Model FLOPs for NemotronH                                                                          |
| [`neva_projection_flops`](#nemo_automodel-components-utils-flops_utils-neva_projection_flops)                     | Model FLOPs for NeVA Projection                                                                    |
| [`qwen3_5_flops`](#nemo_automodel-components-utils-flops_utils-qwen3_5_flops)                                     | Model FLOPs for Qwen3.5 family (MoE and Dense) with hybrid GDN/full attention.                     |
| [`qwen3_flops`](#nemo_automodel-components-utils-flops_utils-qwen3_flops)                                         | Model FLOPs for Qwen3 family - accepts either AutoConfig or normalized config                      |
| [`step3_5_flash_flops`](#nemo_automodel-components-utils-flops_utils-step3_5_flash_flops)                         | Model FLOPs for Step3.5-Flash (GQA + sliding-window / full attention + MoE).                       |
| [`transformer_flops`](#nemo_automodel-components-utils-flops_utils-transformer_flops)                             | Calculate FLOPs for a standard Transformer model - accepts either AutoConfig or normalized config. |

### API

```python
nemo_automodel.components.utils.flops_utils._build_moe_layer_pattern(
    config,
    layers
)
```

Build a list of 0/1 indicating dense(0) vs MoE(1) per layer.

Handles multiple config styles: first\_k\_dense\_replace + moe\_layer\_freq,
mlp\_layer\_types list, etc.

```python
nemo_automodel.components.utils.flops_utils._gdn_attention_per_layer_flops(
    gbs,
    seq_len,
    hidden_size,
    linear_key_head_dim,
    linear_value_head_dim,
    linear_num_key_heads,
    linear_num_value_heads,
    linear_conv_kernel_dim
)
```

FLOPs for a single Gated DeltaNet (GDN / linear attention) layer.

Based on the GDN FLOPs calculator from Megatron-Bridge PR #2925.

```python
nemo_automodel.components.utils.flops_utils._hybrid_model_flops(
    config,
    gbs,
    seq_len
)
```

Model FLOPs for hybrid model

```python
nemo_automodel.components.utils.flops_utils._mamba_layer_flops(
    config,
    gbs,
    seq_len
)
```

Model FLOPs for Mamba layer.

Multiplied by 6 (3x fwd+bwd \* 2x FMA) for in\_proj/out\_proj (standard GEMMs),
and 7 \* 3 = 21 for scan (non-GEMM kernel, higher op count per element).

```python
nemo_automodel.components.utils.flops_utils._mla_attention_per_layer_flops(
    gbs,
    seq_len,
    hs,
    attention_heads,
    q_lora_rank,
    kv_lora_rank,
    qk_rope_head_dim,
    qk_nope_head_dim,
    v_head_dim,
    index_topk = None,
    index_n_heads = 0,
    index_head_dim = 0
)
```

Per-layer FLOPs for Multi-Latent Attention (MLA).

Shared by DeepSeek V3, Kimi K2.5, Mistral Small 4, GLM-5, etc.

When index\_topk is set (DSA / sparse attention), accounts for:

* Sparse main attention BMM: S \* index\_topk instead of 0.5 \* S^2
* DSA indexer overhead: Q/K/weights projections + full S^2 indexer BMM

```python
nemo_automodel.components.utils.flops_utils._mla_moe_model_flops(
    gbs,
    seq_len,
    hs,
    layers,
    attention_heads,
    vocab_size,
    q_lora_rank,
    kv_lora_rank,
    qk_rope_head_dim,
    qk_nope_head_dim,
    v_head_dim,
    dense_ffn_hs,
    moe_ffn_hs,
    moe_router_topk,
    moe_shared_expert_hs,
    moe_layer_pattern,
    mtp_num_layers = 0,
    index_topk = None,
    index_n_heads = 0,
    index_head_dim = 0
)
```

FLOPs for MLA + MoE transformer models (DeepSeek-V3 style).

**Parameters:**

List of 0/1 per layer (0=dense, 1=MoE).

Total intermediate size for all shared experts combined.

If set, use DSA sparse attention with this many selected positions.

Number of heads in the DSA indexer.

Head dimension of the DSA indexer.

```python
nemo_automodel.components.utils.flops_utils._nemotronh_mlp_layer_flops(
    config,
    gbs,
    seq_len
)
```

Model FLOPs for MLP layer. Assume gated linear unit.

```python
nemo_automodel.components.utils.flops_utils._nemotronh_moe_layer_flops(
    config,
    gbs,
    seq_len
)
```

Model FLOPs for a MoE layer in Nemotron V3/Super V3 (hybrid Mamba/Attention/MoE).

Nemotron V3 uses relu2 (non-gated) for both routed and shared experts,
so each expert has 2 linear projections (up\_proj + down\_proj), not 3.

When moe\_latent\_size is set (Super V3), routed experts operate in a reduced
latent space with additional projection layers (fc1\_latent\_proj, fc2\_latent\_proj).
The shared expert and gate always operate in the full hidden\_size dimension.

```python
nemo_automodel.components.utils.flops_utils._nemotronh_mtp_flops(
    config,
    gbs,
    seq_len,
    num_mtp_layers,
    mtp_block_types,
    use_repeated_layer
)
```

Model FLOPs for the Multi-Token-Prediction (MTP) head of Nemotron-3 Super/Ultra.

The head predicts `num_mtp_layers` (N) additional tokens. Each of the N depths runs
the MTP block pattern once, plus a depth-fusion projection (`eh_proj`: cat\[enorm(embed),
hnorm(hidden)] of size 2\*hidden -> hidden) and a vocab projection (weight-tied lm\_head).

Repeated layer: when `use_repeated_layer` is True the model builds a SINGLE physical
depth (`mtp_block_types` lists its sublayers) and reuses it across all N depths -- but
it still EXECUTES once per depth, so the block compute is N x the physical sublayers.
That N x is the repeated layer's FLOPs. When False, `mtp_block_types` already spans all
N physical depths, so the block runs once.

These settings are NOT fully recoverable from the HF config (it retains only the physical
depth count and omits the block pattern), so callers pass the effective values read from
the built model: `num_mtp_layers = model.mtp_config.num_layers` and
`mtp_block_types = [s.block_type for s in model.mtp.layers]`.

**Parameters:**

effective MTP depths actually run (model.mtp\_config.num\_layers).

block types ("mamba"/"attention"/"mlp"/"moe") of the physical MTP
sublayers (model.mtp.layers).

True if the physical depth is reused across the N depths.

```python
nemo_automodel.components.utils.flops_utils._non_mla_attn_layer_flops(
    config,
    gbs,
    seq_len
)
```

Model FLOPs for attention layer

```python
nemo_automodel.components.utils.flops_utils.attention_flops_calculator(
    seqlen,
    hidden_size,
    num_attention_heads,
    num_query_groups,
    kv_channels: typing.Optional[int] = None,
    is_swa: bool = False,
    swa_window_size: int = 128
)
```

Calculate the flops for the attention part.

```python
nemo_automodel.components.utils.flops_utils.bert_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Model FLOPs for BERT family - accepts either AutoConfig or normalized config

```python
nemo_automodel.components.utils.flops_utils.calculate_mfu(
    tflops,
    world_size,
    time_seconds,
    reference_mfu = 1979.0
)
```

Calculate Model FLOPs Utilization (MFU).

**Parameters:**

TFLOPs per GPU

Total number of GPUs

Time taken for computation

Peak TFLOPs of the hardware (default: H100)

**Returns:**

MFU as a percentage

```python
nemo_automodel.components.utils.flops_utils.clip_vit_l_flops(
    config
)
```

Model FLOPs for CLIP ViT

```python
nemo_automodel.components.utils.flops_utils.deepseekv3_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Model FLOPs for DeepSeek V3 - accepts either AutoConfig or normalized config

```python
nemo_automodel.components.utils.flops_utils.flux_flops(
    config
)
```

Model FLOPs for FLUX

```python
nemo_automodel.components.utils.flops_utils.get_flops_formula_for_hf_config(
    config: typing.Any
) -> typing.Optional[typing.Callable]
```

Get the appropriate FLOPs formula function for a given HuggingFace config.

**Parameters:**

HuggingFace model config object

**Returns:** `Optional[Callable]`

The appropriate FLOPs formula function, or None if model type is not supported

```python
nemo_automodel.components.utils.flops_utils.glm4_moe_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Estimate FLOPs for GLM4 MoE model configurations.

```python
nemo_automodel.components.utils.flops_utils.gpt3_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Model FLOPs for GPT3 family - accepts either AutoConfig or normalized config

```python
nemo_automodel.components.utils.flops_utils.gpt_oss_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Model FLOPs for GPT-OSS

```python
nemo_automodel.components.utils.flops_utils.gpt_oss_flops_calculator(
    gbs,
    num_layers,
    seqlen,
    hidden_size,
    num_attention_heads,
    num_query_groups,
    moe_ffn_hidden_size,
    moe_router_topk,
    vocab_size,
    kv_channels: typing.Optional[int] = None,
    swa_window_size: int = 128,
    window_attn_skip_freq: typing.Optional[int] = 2
)
```

Calculate the flops for the GPT-OSS model

```python
nemo_automodel.components.utils.flops_utils.llama2_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Model FLOPs for llama2 family - accepts either AutoConfig or normalized config

```python
nemo_automodel.components.utils.flops_utils.llama3_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Model FLOPs for llama3 family - accepts either AutoConfig or normalized config

```python
nemo_automodel.components.utils.flops_utils.loss_flops_calculator(
    seqlen,
    hidden_size,
    vocab_size
)
```

Calculate the flops for the loss

```python
nemo_automodel.components.utils.flops_utils.minimax_m2_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Model FLOPs for MiniMax-M2 family - accepts either AutoConfig or normalized config.

Architecture: GQA attention (Q/K/V/O separate projections, head\_dim may differ from
hidden\_size // num\_heads) + MoE with SwiGLU (no shared experts by default).
Optionally includes MTP (Multi-Token Prediction) modules gated by use\_mtp.

```python
nemo_automodel.components.utils.flops_utils.mixtral_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Model FLOPs for mixtral family - accepts either AutoConfig or normalized config

```python
nemo_automodel.components.utils.flops_utils.mla_moe_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Model FLOPs for MLA + MoE models (Kimi K2, GLM-5, Mistral Small 4, etc.).

Handles VL wrappers by extracting text\_config if present.

```python
nemo_automodel.components.utils.flops_utils.moe_mlp_flops_calculator(
    seqlen,
    hidden_size,
    moe_ffn_hidden_size,
    moe_router_topk,
    gated_linear_unit: bool = True
)
```

Calculate the flops for the MLP

```python
nemo_automodel.components.utils.flops_utils.nemotron_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Model FLOPs for nemotron family - accepts either AutoConfig or normalized config

```python
nemo_automodel.components.utils.flops_utils.nemotronh_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Model FLOPs for NemotronH

```python
nemo_automodel.components.utils.flops_utils.neva_projection_flops(
    config
)
```

Model FLOPs for NeVA Projection

```python
nemo_automodel.components.utils.flops_utils.qwen3_5_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Model FLOPs for Qwen3.5 family (MoE and Dense) with hybrid GDN/full attention.

Qwen3.5 uses a hybrid attention pattern: 75% GDN (linear attention) layers
and 25% standard GQA (full attention) layers (full\_attention\_interval=4).
Supports both the MoE variant (Qwen3.5-35B-A3B) and Dense variant (Qwen3.5-27B).

```python
nemo_automodel.components.utils.flops_utils.qwen3_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Model FLOPs for Qwen3 family - accepts either AutoConfig or normalized config

```python
nemo_automodel.components.utils.flops_utils.step3_5_flash_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Model FLOPs for Step3.5-Flash (GQA + sliding-window / full attention + MoE).

Architecture: hybrid full/SWA attention with different head counts per type,
MoE with shared expert on most layers, first few layers dense, SwiGLU.

```python
nemo_automodel.components.utils.flops_utils.transformer_flops(
    config,
    gbs = 1,
    seq_len = None
)
```

Calculate FLOPs for a standard Transformer model - accepts either AutoConfig or normalized config.
Note: This does not cover encoder-decoder models.