nemo_automodel.components.utils.flops_utils#

Module Contents#

Functions#

calculate_mfu

Calculate Model FLOPs Utilization (MFU).

gpt3_flops

Model FLOPs for GPT3 family - accepts either AutoConfig or normalized config

llama2_flops

Model FLOPs for llama2 family - accepts either AutoConfig or normalized config

llama3_flops

Model FLOPs for llama3 family - accepts either AutoConfig or normalized config

nemotron_flops

Model FLOPs for nemotron family - accepts either AutoConfig or normalized config

mixtral_flops

Model FLOPs for mixtral family - accepts either AutoConfig or normalized config

qwen3_flops

Model FLOPs for Qwen3 family - accepts either AutoConfig or normalized config

bert_flops

Model FLOPs for BERT family - accepts either AutoConfig or normalized config

transformer_flops

Calculate FLOPs for a standard Transformer model - accepts either AutoConfig or normalized config. Note: This does not cover encoder-decoder models.

clip_vit_l_flops

Model FLOPs for CLIP ViT

neva_projection_flops

Model FLOPs for NeVA Projection

flux_flops

Model FLOPs for FLUX

deepseekv3_flops

Model FLOPs for DeepSeek V3 - accepts either AutoConfig or normalized config

_nemotronh_mlp_layer_flops

Model FLOPs for MLP layer. Assume gated linear unit.

_nemotronh_moe_layer_flops

Model FLOPs for a MoE layer in Nemotron V3/Super V3 (hybrid Mamba/Attention/MoE).

_non_mla_attn_layer_flops

Model FLOPs for attention layer

_mamba_layer_flops

Model FLOPs for Mamba layer.

_hybrid_model_flops

Model FLOPs for hybrid model

nemotronh_flops

Model FLOPs for NemotronH

attention_flops_calculator

Calculate the flops for the attention part.

moe_mlp_flops_calculator

Calculate the flops for the MLP

loss_flops_calculator

Calculate the flops for the loss

gpt_oss_flops_calculator

Calculate the flops for the GPT-OSS model

gpt_oss_flops

Model FLOPs for GPT-OSS

glm4_moe_flops

minimax_m2_flops

Model FLOPs for MiniMax-M2 family - accepts either AutoConfig or normalized config.

_gdn_attention_per_layer_flops

FLOPs for a single Gated DeltaNet (GDN / linear attention) layer.

qwen3_5_flops

Model FLOPs for Qwen3.5 family (MoE and Dense) with hybrid GDN/full attention.

_mla_attention_per_layer_flops

Per-layer FLOPs for Multi-Latent Attention (MLA).

_mla_moe_model_flops

FLOPs for MLA + MoE transformer models (DeepSeek-V3 style).

_build_moe_layer_pattern

Build a list of 0/1 indicating dense(0) vs MoE(1) per layer.

mla_moe_flops

Model FLOPs for MLA + MoE models (Kimi K2, GLM-5, Mistral Small 4, etc.).

step3_5_flash_flops

Model FLOPs for Step3.5-Flash (GQA + sliding-window / full attention + MoE).

get_flops_formula_for_hf_config

Get the appropriate FLOPs formula function for a given HuggingFace config.

API#

nemo_automodel.components.utils.flops_utils.calculate_mfu(tflops, world_size, time_seconds, reference_mfu=1979.0)[source]#

Calculate Model FLOPs Utilization (MFU).

Parameters:
  • tflops – TFLOPs per GPU

  • world_size – Total number of GPUs

  • time_seconds – Time taken for computation

  • reference_mfu – Peak TFLOPs of the hardware (default: H100)

Returns:

MFU as a percentage

nemo_automodel.components.utils.flops_utils.gpt3_flops(config, gbs=1, seq_len=None)[source]#

Model FLOPs for GPT3 family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.llama2_flops(config, gbs=1, seq_len=None)[source]#

Model FLOPs for llama2 family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.llama3_flops(config, gbs=1, seq_len=None)[source]#

Model FLOPs for llama3 family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.nemotron_flops(config, gbs=1, seq_len=None)[source]#

Model FLOPs for nemotron family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.mixtral_flops(config, gbs=1, seq_len=None)[source]#

Model FLOPs for mixtral family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.qwen3_flops(config, gbs=1, seq_len=None)[source]#

Model FLOPs for Qwen3 family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.bert_flops(config, gbs=1, seq_len=None)[source]#

Model FLOPs for BERT family - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils.transformer_flops(config, gbs=1, seq_len=None)[source]#

Calculate FLOPs for a standard Transformer model - accepts either AutoConfig or normalized config. Note: This does not cover encoder-decoder models.

nemo_automodel.components.utils.flops_utils.clip_vit_l_flops(config)[source]#

Model FLOPs for CLIP ViT

nemo_automodel.components.utils.flops_utils.neva_projection_flops(config)[source]#

Model FLOPs for NeVA Projection

nemo_automodel.components.utils.flops_utils.flux_flops(config)[source]#

Model FLOPs for FLUX

nemo_automodel.components.utils.flops_utils.deepseekv3_flops(config, gbs=1, seq_len=None)[source]#

Model FLOPs for DeepSeek V3 - accepts either AutoConfig or normalized config

nemo_automodel.components.utils.flops_utils._nemotronh_mlp_layer_flops(config, gbs, seq_len)[source]#

Model FLOPs for MLP layer. Assume gated linear unit.

nemo_automodel.components.utils.flops_utils._nemotronh_moe_layer_flops(config, gbs, seq_len)[source]#

Model FLOPs for a MoE layer in Nemotron V3/Super V3 (hybrid Mamba/Attention/MoE).

Nemotron V3 uses relu2 (non-gated) for both routed and shared experts, so each expert has 2 linear projections (up_proj + down_proj), not 3.

When moe_latent_size is set (Super V3), routed experts operate in a reduced latent space with additional projection layers (fc1_latent_proj, fc2_latent_proj). The shared expert and gate always operate in the full hidden_size dimension.

Accounts for:

  1. Routed experts: only num_experts_per_tok activated per token.

  2. Shared expert: always active for every token (full hidden_size).

  3. Router/gate: linear projection hidden_size -> n_routed_experts.

  4. Latent projections (if moe_latent_size is set): down and up projections.

nemo_automodel.components.utils.flops_utils._non_mla_attn_layer_flops(config, gbs, seq_len)[source]#

Model FLOPs for attention layer

nemo_automodel.components.utils.flops_utils._mamba_layer_flops(config, gbs, seq_len)[source]#

Model FLOPs for Mamba layer.

Three components:

  • in_proj: input projections (x_proj, z_proj, dt_proj, B_proj, C_proj)

  • scan: SSM scan kernel (7x factor accounts for the full SSD scan cost)

  • out_proj: output projection back to hidden_size Multiplied by 6 (3x fwd+bwd * 2x FMA) for in_proj/out_proj (standard GEMMs), and 7 * 3 = 21 for scan (non-GEMM kernel, higher op count per element).

nemo_automodel.components.utils.flops_utils._hybrid_model_flops(config, gbs, seq_len)[source]#

Model FLOPs for hybrid model

nemo_automodel.components.utils.flops_utils.nemotronh_flops(config, gbs=1, seq_len=None)[source]#

Model FLOPs for NemotronH

nemo_automodel.components.utils.flops_utils.attention_flops_calculator(
seqlen,
hidden_size,
num_attention_heads,
num_query_groups,
kv_channels: Optional[int] = None,
is_swa: bool = False,
swa_window_size: int = 128,
)[source]#

Calculate the flops for the attention part.

nemo_automodel.components.utils.flops_utils.moe_mlp_flops_calculator(
seqlen,
hidden_size,
moe_ffn_hidden_size,
moe_router_topk,
gated_linear_unit: bool = True,
)[source]#

Calculate the flops for the MLP

nemo_automodel.components.utils.flops_utils.loss_flops_calculator(seqlen, hidden_size, vocab_size)[source]#

Calculate the flops for the loss

nemo_automodel.components.utils.flops_utils.gpt_oss_flops_calculator(
gbs,
num_layers,
seqlen,
hidden_size,
num_attention_heads,
num_query_groups,
moe_ffn_hidden_size,
moe_router_topk,
vocab_size,
kv_channels: Optional[int] = None,
swa_window_size: int = 128,
window_attn_skip_freq: Optional[int] = 2,
)[source]#

Calculate the flops for the GPT-OSS model

nemo_automodel.components.utils.flops_utils.gpt_oss_flops(config, gbs=1, seq_len=None)[source]#

Model FLOPs for GPT-OSS

nemo_automodel.components.utils.flops_utils.glm4_moe_flops(config, gbs=1, seq_len=None)[source]#
nemo_automodel.components.utils.flops_utils.minimax_m2_flops(config, gbs=1, seq_len=None)[source]#

Model FLOPs for MiniMax-M2 family - accepts either AutoConfig or normalized config.

Architecture: GQA attention (Q/K/V/O separate projections, head_dim may differ from hidden_size // num_heads) + MoE with SwiGLU (no shared experts by default). Optionally includes MTP (Multi-Token Prediction) modules gated by use_mtp.

nemo_automodel.components.utils.flops_utils._gdn_attention_per_layer_flops(
gbs,
seq_len,
hidden_size,
linear_key_head_dim,
linear_value_head_dim,
linear_num_key_heads,
linear_num_value_heads,
linear_conv_kernel_dim,
)[source]#

FLOPs for a single Gated DeltaNet (GDN / linear attention) layer.

Based on the GDN FLOPs calculator from Megatron-Bridge PR #2925.

nemo_automodel.components.utils.flops_utils.qwen3_5_flops(config, gbs=1, seq_len=None)[source]#

Model FLOPs for Qwen3.5 family (MoE and Dense) with hybrid GDN/full attention.

Qwen3.5 uses a hybrid attention pattern: 75% GDN (linear attention) layers and 25% standard GQA (full attention) layers (full_attention_interval=4). Supports both the MoE variant (Qwen3.5-35B-A3B) and Dense variant (Qwen3.5-27B).

nemo_automodel.components.utils.flops_utils._mla_attention_per_layer_flops(
gbs,
seq_len,
hs,
attention_heads,
q_lora_rank,
kv_lora_rank,
qk_rope_head_dim,
qk_nope_head_dim,
v_head_dim,
index_topk=None,
index_n_heads=0,
index_head_dim=0,
)[source]#

Per-layer FLOPs for Multi-Latent Attention (MLA).

Shared by DeepSeek V3, Kimi K2.5, Mistral Small 4, GLM-5, etc.

When index_topk is set (DSA / sparse attention), accounts for:

  • Sparse main attention BMM: S * index_topk instead of 0.5 * S^2

  • DSA indexer overhead: Q/K/weights projections + full S^2 indexer BMM

nemo_automodel.components.utils.flops_utils._mla_moe_model_flops(
gbs,
seq_len,
hs,
layers,
attention_heads,
vocab_size,
q_lora_rank,
kv_lora_rank,
qk_rope_head_dim,
qk_nope_head_dim,
v_head_dim,
dense_ffn_hs,
moe_ffn_hs,
moe_router_topk,
moe_shared_expert_hs,
moe_layer_pattern,
mtp_num_layers=0,
index_topk=None,
index_n_heads=0,
index_head_dim=0,
)[source]#

FLOPs for MLA + MoE transformer models (DeepSeek-V3 style).

Parameters:
  • moe_layer_pattern – List of 0/1 per layer (0=dense, 1=MoE).

  • moe_shared_expert_hs – Total intermediate size for all shared experts combined.

  • index_topk – If set, use DSA sparse attention with this many selected positions.

  • index_n_heads – Number of heads in the DSA indexer.

  • index_head_dim – Head dimension of the DSA indexer.

nemo_automodel.components.utils.flops_utils._build_moe_layer_pattern(config, layers)[source]#

Build a list of 0/1 indicating dense(0) vs MoE(1) per layer.

Handles multiple config styles: first_k_dense_replace + moe_layer_freq, mlp_layer_types list, etc.

nemo_automodel.components.utils.flops_utils.mla_moe_flops(config, gbs=1, seq_len=None)[source]#

Model FLOPs for MLA + MoE models (Kimi K2, GLM-5, Mistral Small 4, etc.).

Handles VL wrappers by extracting text_config if present.

nemo_automodel.components.utils.flops_utils.step3_5_flash_flops(config, gbs=1, seq_len=None)[source]#

Model FLOPs for Step3.5-Flash (GQA + sliding-window / full attention + MoE).

Architecture: hybrid full/SWA attention with different head counts per type, MoE with shared expert on most layers, first few layers dense, SwiGLU.

nemo_automodel.components.utils.flops_utils.get_flops_formula_for_hf_config(
config: Any,
) Optional[Callable][source]#

Get the appropriate FLOPs formula function for a given HuggingFace config.

Parameters:

config – HuggingFace model config object

Returns:

The appropriate FLOPs formula function, or None if model type is not supported