nemo_automodel.components.utils.flops_utils

View as Markdown

Module Contents

Functions

NameDescription
_build_moe_layer_patternBuild a list of 0/1 indicating dense(0) vs MoE(1) per layer.
_gdn_attention_per_layer_flopsFLOPs for a single Gated DeltaNet (GDN / linear attention) layer.
_hybrid_model_flopsModel FLOPs for hybrid model
_mamba_layer_flopsModel FLOPs for Mamba layer.
_mla_attention_per_layer_flopsPer-layer FLOPs for Multi-Latent Attention (MLA).
_mla_moe_model_flopsFLOPs for MLA + MoE transformer models (DeepSeek-V3 style).
_nemotronh_mlp_layer_flopsModel FLOPs for MLP layer. Assume gated linear unit.
_nemotronh_moe_layer_flopsModel FLOPs for a MoE layer in Nemotron V3/Super V3 (hybrid Mamba/Attention/MoE).
_nemotronh_mtp_flopsModel FLOPs for the Multi-Token-Prediction (MTP) head of Nemotron-3 Super/Ultra.
_non_mla_attn_layer_flopsModel FLOPs for attention layer
attention_flops_calculatorCalculate the flops for the attention part.
bert_flopsModel FLOPs for BERT family - accepts either AutoConfig or normalized config
calculate_mfuCalculate Model FLOPs Utilization (MFU).
clip_vit_l_flopsModel FLOPs for CLIP ViT
deepseekv3_flopsModel FLOPs for DeepSeek V3 - accepts either AutoConfig or normalized config
flux_flopsModel FLOPs for FLUX
get_flops_formula_for_hf_configGet the appropriate FLOPs formula function for a given HuggingFace config.
glm4_moe_flopsEstimate FLOPs for GLM4 MoE model configurations.
gpt3_flopsModel FLOPs for GPT3 family - accepts either AutoConfig or normalized config
gpt_oss_flopsModel FLOPs for GPT-OSS
gpt_oss_flops_calculatorCalculate the flops for the GPT-OSS model
llama2_flopsModel FLOPs for llama2 family - accepts either AutoConfig or normalized config
llama3_flopsModel FLOPs for llama3 family - accepts either AutoConfig or normalized config
loss_flops_calculatorCalculate the flops for the loss
minimax_m2_flopsModel FLOPs for MiniMax-M2 family - accepts either AutoConfig or normalized config.
mixtral_flopsModel FLOPs for mixtral family - accepts either AutoConfig or normalized config
mla_moe_flopsModel FLOPs for MLA + MoE models (Kimi K2, GLM-5, Mistral Small 4, etc.).
moe_mlp_flops_calculatorCalculate the flops for the MLP
nemotron_flopsModel FLOPs for nemotron family - accepts either AutoConfig or normalized config
nemotronh_flopsModel FLOPs for NemotronH
neva_projection_flopsModel FLOPs for NeVA Projection
qwen3_5_flopsModel FLOPs for Qwen3.5 family (MoE and Dense) with hybrid GDN/full attention.
qwen3_flopsModel FLOPs for Qwen3 family - accepts either AutoConfig or normalized config
step3_5_flash_flopsModel FLOPs for Step3.5-Flash (GQA + sliding-window / full attention + MoE).
transformer_flopsCalculate FLOPs for a standard Transformer model - accepts either AutoConfig or normalized config.

API

nemo_automodel.components.utils.flops_utils._build_moe_layer_pattern(
config,
layers
)

Build a list of 0/1 indicating dense(0) vs MoE(1) per layer.

Handles multiple config styles: first_k_dense_replace + moe_layer_freq, mlp_layer_types list, etc.

nemo_automodel.components.utils.flops_utils._gdn_attention_per_layer_flops(
gbs,
seq_len,
hidden_size,
linear_key_head_dim,
linear_value_head_dim,
linear_num_key_heads,
linear_num_value_heads,
linear_conv_kernel_dim
)

FLOPs for a single Gated DeltaNet (GDN / linear attention) layer.

Based on the GDN FLOPs calculator from Megatron-Bridge PR #2925.

nemo_automodel.components.utils.flops_utils._hybrid_model_flops(
config,
gbs,
seq_len
)

Model FLOPs for hybrid model

nemo_automodel.components.utils.flops_utils._mamba_layer_flops(
config,
gbs,
seq_len
)

Model FLOPs for Mamba layer.

Multiplied by 6 (3x fwd+bwd * 2x FMA) for in_proj/out_proj (standard GEMMs), and 7 * 3 = 21 for scan (non-GEMM kernel, higher op count per element).

nemo_automodel.components.utils.flops_utils._mla_attention_per_layer_flops(
gbs,
seq_len,
hs,
attention_heads,
q_lora_rank,
kv_lora_rank,
qk_rope_head_dim,
qk_nope_head_dim,
v_head_dim,
index_topk = None,
index_n_heads = 0,
index_head_dim = 0
)

Per-layer FLOPs for Multi-Latent Attention (MLA).

Shared by DeepSeek V3, Kimi K2.5, Mistral Small 4, GLM-5, etc.

When index_topk is set (DSA / sparse attention), accounts for:

  • Sparse main attention BMM: S * index_topk instead of 0.5 * S^2
  • DSA indexer overhead: Q/K/weights projections + full S^2 indexer BMM
nemo_automodel.components.utils.flops_utils._mla_moe_model_flops(
gbs,
seq_len,
hs,
layers,
attention_heads,
vocab_size,
q_lora_rank,
kv_lora_rank,
qk_rope_head_dim,
qk_nope_head_dim,
v_head_dim,
dense_ffn_hs,
moe_ffn_hs,
moe_router_topk,
moe_shared_expert_hs,
moe_layer_pattern,
mtp_num_layers = 0,
index_topk = None,
index_n_heads = 0,
index_head_dim = 0
)

FLOPs for MLA + MoE transformer models (DeepSeek-V3 style).

Parameters:

moe_layer_pattern

List of 0/1 per layer (0=dense, 1=MoE).

moe_shared_expert_hs

Total intermediate size for all shared experts combined.

index_topk
Defaults to None

If set, use DSA sparse attention with this many selected positions.

index_n_heads
Defaults to 0

Number of heads in the DSA indexer.

index_head_dim
Defaults to 0

Head dimension of the DSA indexer.

nemo_automodel.components.utils.flops_utils._nemotronh_mlp_layer_flops(
config,
gbs,
seq_len
)

Model FLOPs for MLP layer. Assume gated linear unit.

nemo_automodel.components.utils.flops_utils._nemotronh_moe_layer_flops(
config,
gbs,
seq_len
)

Model FLOPs for a MoE layer in Nemotron V3/Super V3 (hybrid Mamba/Attention/MoE).

Nemotron V3 uses relu2 (non-gated) for both routed and shared experts, so each expert has 2 linear projections (up_proj + down_proj), not 3.

When moe_latent_size is set (Super V3), routed experts operate in a reduced latent space with additional projection layers (fc1_latent_proj, fc2_latent_proj). The shared expert and gate always operate in the full hidden_size dimension.

nemo_automodel.components.utils.flops_utils._nemotronh_mtp_flops(
config,
gbs,
seq_len,
num_mtp_layers,
mtp_block_types,
use_repeated_layer
)

Model FLOPs for the Multi-Token-Prediction (MTP) head of Nemotron-3 Super/Ultra.

The head predicts num_mtp_layers (N) additional tokens. Each of the N depths runs the MTP block pattern once, plus a depth-fusion projection (eh_proj: cat[enorm(embed), hnorm(hidden)] of size 2*hidden -> hidden) and a vocab projection (weight-tied lm_head).

Repeated layer: when use_repeated_layer is True the model builds a SINGLE physical depth (mtp_block_types lists its sublayers) and reuses it across all N depths — but it still EXECUTES once per depth, so the block compute is N x the physical sublayers. That N x is the repeated layer’s FLOPs. When False, mtp_block_types already spans all N physical depths, so the block runs once.

These settings are NOT fully recoverable from the HF config (it retains only the physical depth count and omits the block pattern), so callers pass the effective values read from the built model: num_mtp_layers = model.mtp_config.num_layers and mtp_block_types = [s.block_type for s in model.mtp.layers].

Parameters:

num_mtp_layers

effective MTP depths actually run (model.mtp_config.num_layers).

mtp_block_types

block types (“mamba”/“attention”/“mlp”/“moe”) of the physical MTP sublayers (model.mtp.layers).

use_repeated_layer

True if the physical depth is reused across the N depths.

nemo_automodel.components.utils.flops_utils._non_mla_attn_layer_flops(
config,
gbs,
seq_len
)

Model FLOPs for attention layer

nemo_automodel.components.utils.flops_utils.attention_flops_calculator(
seqlen,
hidden_size,
num_attention_heads,
num_query_groups,
kv_channels: typing.Optional[int] = None,
is_swa: bool = False,
swa_window_size: int = 128
)

Calculate the flops for the attention part.

nemo_automodel.components.utils.flops_utils.bert_flops(
config,
gbs = 1,
seq_len = None
)

Model FLOPs for BERT family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.calculate_mfu(
tflops,
world_size,
time_seconds,
reference_mfu = 1979.0
)

Calculate Model FLOPs Utilization (MFU).

Parameters:

tflops

TFLOPs per GPU

world_size

Total number of GPUs

time_seconds

Time taken for computation

reference_mfu
Defaults to 1979.0

Peak TFLOPs of the hardware (default: H100)

Returns:

MFU as a percentage

nemo_automodel.components.utils.flops_utils.clip_vit_l_flops(
config
)

Model FLOPs for CLIP ViT

nemo_automodel.components.utils.flops_utils.deepseekv3_flops(
config,
gbs = 1,
seq_len = None
)

Model FLOPs for DeepSeek V3 - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.flux_flops(
config
)

Model FLOPs for FLUX

nemo_automodel.components.utils.flops_utils.get_flops_formula_for_hf_config(
config: typing.Any
) -> typing.Optional[typing.Callable]

Get the appropriate FLOPs formula function for a given HuggingFace config.

Parameters:

config
Any

HuggingFace model config object

Returns: Optional[Callable]

The appropriate FLOPs formula function, or None if model type is not supported

nemo_automodel.components.utils.flops_utils.glm4_moe_flops(
config,
gbs = 1,
seq_len = None
)

Estimate FLOPs for GLM4 MoE model configurations.

nemo_automodel.components.utils.flops_utils.gpt3_flops(
config,
gbs = 1,
seq_len = None
)

Model FLOPs for GPT3 family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.gpt_oss_flops(
config,
gbs = 1,
seq_len = None
)

Model FLOPs for GPT-OSS

nemo_automodel.components.utils.flops_utils.gpt_oss_flops_calculator(
gbs,
num_layers,
seqlen,
hidden_size,
num_attention_heads,
num_query_groups,
moe_ffn_hidden_size,
moe_router_topk,
vocab_size,
kv_channels: typing.Optional[int] = None,
swa_window_size: int = 128,
window_attn_skip_freq: typing.Optional[int] = 2
)

Calculate the flops for the GPT-OSS model

nemo_automodel.components.utils.flops_utils.llama2_flops(
config,
gbs = 1,
seq_len = None
)

Model FLOPs for llama2 family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.llama3_flops(
config,
gbs = 1,
seq_len = None
)

Model FLOPs for llama3 family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.loss_flops_calculator(
seqlen,
hidden_size,
vocab_size
)

Calculate the flops for the loss

nemo_automodel.components.utils.flops_utils.minimax_m2_flops(
config,
gbs = 1,
seq_len = None
)

Model FLOPs for MiniMax-M2 family - accepts either AutoConfig or normalized config.

Architecture: GQA attention (Q/K/V/O separate projections, head_dim may differ from hidden_size // num_heads) + MoE with SwiGLU (no shared experts by default). Optionally includes MTP (Multi-Token Prediction) modules gated by use_mtp.

nemo_automodel.components.utils.flops_utils.mixtral_flops(
config,
gbs = 1,
seq_len = None
)

Model FLOPs for mixtral family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.mla_moe_flops(
config,
gbs = 1,
seq_len = None
)

Model FLOPs for MLA + MoE models (Kimi K2, GLM-5, Mistral Small 4, etc.).

Handles VL wrappers by extracting text_config if present.

nemo_automodel.components.utils.flops_utils.moe_mlp_flops_calculator(
seqlen,
hidden_size,
moe_ffn_hidden_size,
moe_router_topk,
gated_linear_unit: bool = True
)

Calculate the flops for the MLP

nemo_automodel.components.utils.flops_utils.nemotron_flops(
config,
gbs = 1,
seq_len = None
)

Model FLOPs for nemotron family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.nemotronh_flops(
config,
gbs = 1,
seq_len = None
)

Model FLOPs for NemotronH

nemo_automodel.components.utils.flops_utils.neva_projection_flops(
config
)

Model FLOPs for NeVA Projection

nemo_automodel.components.utils.flops_utils.qwen3_5_flops(
config,
gbs = 1,
seq_len = None
)

Model FLOPs for Qwen3.5 family (MoE and Dense) with hybrid GDN/full attention.

Qwen3.5 uses a hybrid attention pattern: 75% GDN (linear attention) layers and 25% standard GQA (full attention) layers (full_attention_interval=4). Supports both the MoE variant (Qwen3.5-35B-A3B) and Dense variant (Qwen3.5-27B).

nemo_automodel.components.utils.flops_utils.qwen3_flops(
config,
gbs = 1,
seq_len = None
)

Model FLOPs for Qwen3 family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.step3_5_flash_flops(
config,
gbs = 1,
seq_len = None
)

Model FLOPs for Step3.5-Flash (GQA + sliding-window / full attention + MoE).

Architecture: hybrid full/SWA attention with different head counts per type, MoE with shared expert on most layers, first few layers dense, SwiGLU.

nemo_automodel.components.utils.flops_utils.transformer_flops(
config,
gbs = 1,
seq_len = None
)

Calculate FLOPs for a standard Transformer model - accepts either AutoConfig or normalized config. Note: This does not cover encoder-decoder models.