nemo_automodel.components.utils.flops_utils
nemo_automodel.components.utils.flops_utils
Module Contents
Functions
API
Build a list of 0/1 indicating dense(0) vs MoE(1) per layer.
Handles multiple config styles: first_k_dense_replace + moe_layer_freq, mlp_layer_types list, etc.
FLOPs for a single Gated DeltaNet (GDN / linear attention) layer.
Based on the GDN FLOPs calculator from Megatron-Bridge PR #2925.
Model FLOPs for hybrid model
Model FLOPs for Mamba layer.
Multiplied by 6 (3x fwd+bwd * 2x FMA) for in_proj/out_proj (standard GEMMs), and 7 * 3 = 21 for scan (non-GEMM kernel, higher op count per element).
Per-layer FLOPs for Multi-Latent Attention (MLA).
Shared by DeepSeek V3, Kimi K2.5, Mistral Small 4, GLM-5, etc.
When index_topk is set (DSA / sparse attention), accounts for:
- Sparse main attention BMM: S * index_topk instead of 0.5 * S^2
- DSA indexer overhead: Q/K/weights projections + full S^2 indexer BMM
FLOPs for MLA + MoE transformer models (DeepSeek-V3 style).
Parameters:
List of 0/1 per layer (0=dense, 1=MoE).
Total intermediate size for all shared experts combined.
If set, use DSA sparse attention with this many selected positions.
Number of heads in the DSA indexer.
Head dimension of the DSA indexer.
Model FLOPs for MLP layer. Assume gated linear unit.
Model FLOPs for a MoE layer in Nemotron V3/Super V3 (hybrid Mamba/Attention/MoE).
Nemotron V3 uses relu2 (non-gated) for both routed and shared experts, so each expert has 2 linear projections (up_proj + down_proj), not 3.
When moe_latent_size is set (Super V3), routed experts operate in a reduced latent space with additional projection layers (fc1_latent_proj, fc2_latent_proj). The shared expert and gate always operate in the full hidden_size dimension.
Model FLOPs for the Multi-Token-Prediction (MTP) head of Nemotron-3 Super/Ultra.
The head predicts num_mtp_layers (N) additional tokens. Each of the N depths runs
the MTP block pattern once, plus a depth-fusion projection (eh_proj: cat[enorm(embed),
hnorm(hidden)] of size 2*hidden -> hidden) and a vocab projection (weight-tied lm_head).
Repeated layer: when use_repeated_layer is True the model builds a SINGLE physical
depth (mtp_block_types lists its sublayers) and reuses it across all N depths — but
it still EXECUTES once per depth, so the block compute is N x the physical sublayers.
That N x is the repeated layer’s FLOPs. When False, mtp_block_types already spans all
N physical depths, so the block runs once.
These settings are NOT fully recoverable from the HF config (it retains only the physical
depth count and omits the block pattern), so callers pass the effective values read from
the built model: num_mtp_layers = model.mtp_config.num_layers and
mtp_block_types = [s.block_type for s in model.mtp.layers].
Parameters:
effective MTP depths actually run (model.mtp_config.num_layers).
block types (“mamba”/“attention”/“mlp”/“moe”) of the physical MTP sublayers (model.mtp.layers).
True if the physical depth is reused across the N depths.
Model FLOPs for attention layer
Calculate the flops for the attention part.
Model FLOPs for BERT family - accepts either AutoConfig or normalized config
Calculate Model FLOPs Utilization (MFU).
Parameters:
TFLOPs per GPU
Total number of GPUs
Time taken for computation
Peak TFLOPs of the hardware (default: H100)
Returns:
MFU as a percentage
Model FLOPs for CLIP ViT
Model FLOPs for DeepSeek V3 - accepts either AutoConfig or normalized config
Model FLOPs for FLUX
Get the appropriate FLOPs formula function for a given HuggingFace config.
Parameters:
HuggingFace model config object
Returns: Optional[Callable]
The appropriate FLOPs formula function, or None if model type is not supported
Estimate FLOPs for GLM4 MoE model configurations.
Model FLOPs for GPT3 family - accepts either AutoConfig or normalized config
Model FLOPs for GPT-OSS
Calculate the flops for the GPT-OSS model
Model FLOPs for llama2 family - accepts either AutoConfig or normalized config
Model FLOPs for llama3 family - accepts either AutoConfig or normalized config
Calculate the flops for the loss
Model FLOPs for MiniMax-M2 family - accepts either AutoConfig or normalized config.
Architecture: GQA attention (Q/K/V/O separate projections, head_dim may differ from hidden_size // num_heads) + MoE with SwiGLU (no shared experts by default). Optionally includes MTP (Multi-Token Prediction) modules gated by use_mtp.
Model FLOPs for mixtral family - accepts either AutoConfig or normalized config
Model FLOPs for MLA + MoE models (Kimi K2, GLM-5, Mistral Small 4, etc.).
Handles VL wrappers by extracting text_config if present.
Calculate the flops for the MLP
Model FLOPs for nemotron family - accepts either AutoConfig or normalized config
Model FLOPs for NemotronH
Model FLOPs for NeVA Projection
Model FLOPs for Qwen3.5 family (MoE and Dense) with hybrid GDN/full attention.
Qwen3.5 uses a hybrid attention pattern: 75% GDN (linear attention) layers and 25% standard GQA (full attention) layers (full_attention_interval=4). Supports both the MoE variant (Qwen3.5-35B-A3B) and Dense variant (Qwen3.5-27B).
Model FLOPs for Qwen3 family - accepts either AutoConfig or normalized config
Model FLOPs for Step3.5-Flash (GQA + sliding-window / full attention + MoE).
Architecture: hybrid full/SWA attention with different head counts per type, MoE with shared expert on most layers, first few layers dense, SwiGLU.
Calculate FLOPs for a standard Transformer model - accepts either AutoConfig or normalized config. Note: This does not cover encoder-decoder models.