`bridge.training.utils.flop_utils`#

Module Contents#

Functions#

`get_model_chunk_vp_stage`	Return the virtual-pipeline stage assigned to a model chunk, if any.
`_accumulator_to_int`	Coerce a FLOPs accumulator (`int` or scalar `Tensor`) to `int`.
`resolve_global_flops_seqlen_stats`	Resolve data-parallel-global FLOPS sequence stats from per-rank accumulators.
`_add_flops_accumulator`	Add an int or scalar tensor to a state accumulator.
`_scalar_sum_for_accumulator`	Return a scalar sum without forcing a CUDA host sync inside forward_step.
`_real_subseq_lengths`	Extract sub-sequence lengths from cu_seqlens metadata.
`accumulate_flops_metadata`	Accumulate per-microbatch FLOPS metadata onto `state`.
`vit_flops`	Calculate FLOPs for a Vision Transformer (ViT) encoder + patch merger.
`num_floating_point_operations`	Return the number of floating point operations.

Data#

_lora_seq_stats_cache

API#

bridge.training.utils.flop_utils._lora_seq_stats_cache: dict#: None

bridge.training.utils.flop_utils.get_model_chunk_vp_stage(model: torch.nn.Module) → int | None#

Return the virtual-pipeline stage assigned to a model chunk, if any.

Parameters:: model – Model chunk, possibly wrapped by mixed precision or DDP.
Returns:: The integer virtual-pipeline stage, or None for an unchunked model or a model that does not expose the stage.

bridge.training.utils.flop_utils._accumulator_to_int(value) → int#: Coerce a FLOPs accumulator (int or scalar Tensor) to int.

bridge.training.utils.flop_utils.resolve_global_flops_seqlen_stats( state, *, data_parallel_size: int, vp_size: int | None = None, dp_group=None, ) → tuple[int | None, int | None, int]#

Resolve data-parallel-global FLOPS sequence stats from per-rank accumulators.

Reads the three accumulators populated by the forward step (_flops_seqlen_sum = Σ padded tokens, _flops_seqlen_sq_sum = Σᵢ sᵢ² over real sub-sequences, _flops_vision_patches) and reduces them to global totals across the data-parallel group.

Under variable-length (THD packed) training the per-rank Σᵢ sᵢ² can differ across DP ranks, so a single SUM all-reduce over dp_group is used to get the exact global sum. Dense BSHD training never requests this reduce: every DP rank contributes the same fixed sequence statistics, so extrapolating local * data_parallel_size is exact and avoids an unnecessary collective.

Parameters:

state – Object carrying the _flops_* accumulators (GlobalState).
data_parallel_size – Size of the data-parallel group (used for the extrapolation fallback).
vp_size – Virtual pipeline size. Kept for call-site compatibility; VPP does not rescale these accumulators because they already represent the executed training step consumed by the full-model FLOPS formula.
dp_group – Data-parallel process group to SUM-reduce over. Must be the pure DP group (excluding CP) matching data_parallel_size — CP ranks share the same cu_seqlens and would double-count.

Returns:

(seqlen_sum, seqlen_squared_sum, num_vision_patches). The first two are None when no accumulation happened, signalling the caller to fall back to a fixed-length estimate. num_vision_patches is 0 when no vision tokens were seen.

bridge.training.utils.flop_utils._add_flops_accumulator(state, name: str, delta) → None#: Add an int or scalar tensor to a state accumulator.

bridge.training.utils.flop_utils._scalar_sum_for_accumulator(value: torch.Tensor) → int | torch.Tensor#: Return a scalar sum without forcing a CUDA host sync inside forward_step.

bridge.training.utils.flop_utils._real_subseq_lengths( cu_seqlens: torch.Tensor | None, cu_seqlens_argmin: torch.Tensor | None = None, cu_seqlens_unpadded: torch.Tensor | None = None, cu_seqlens_unpadded_argmin: torch.Tensor | None = None, ) → torch.Tensor | None#

Extract sub-sequence lengths from cu_seqlens metadata.

Prefers cu_seqlens_unpadded (true sub-sequence boundaries when pad_seq_to_mult > 1) over the padded cu_seqlens. Truncates by the corresponding *_argmin when provided. Returns None when no cu_seqlens info is available.

Runs once per micro-batch, so it must stay free of GPU→CPU syncs: cu_seqlens is a (monotonic non-decreasing) cumulative sum, so the diffs are always >= 0 and we do not filter them — a boolean mask like sub_seq_lens[sub_seq_lens > 0] would force a data-dependent-size device sync every micro-batch (the cause of a ~7% throughput regression). Zero-length entries (padding) contribute 0 to Σᵢ sᵢ² so dropping them is unnecessary; the result is identical.

bridge.training.utils.flop_utils.accumulate_flops_metadata( state, tokens: torch.Tensor | None, *, vp_stage: int | None = None, config_seq_len: int | None = None, cu_seqlens: torch.Tensor | None = None, cu_seqlens_argmin: torch.Tensor | None = None, cu_seqlens_unpadded: torch.Tensor | None = None, cu_seqlens_unpadded_argmin: torch.Tensor | None = None, num_vision_patches: int | torch.Tensor | None = None, ) → None#

Accumulate per-microbatch FLOPS metadata onto state.

Under interleaved pipeline parallelism, the forward step runs once per virtual model chunk for the same logical data microbatch. Only virtual stage 0 contributes metadata so model chunking does not multiply the full-model FLOPS estimate. None and 0 both represent the primary/only chunk.

Writes three accumulators consumed by train.py at end of step:

_flops_seqlen_sum: mbs * tokens.shape[1] (padded total tokens this microbatch contributes), or mbs * config_seq_len for dense non-packed batches whose tensors were already context-parallel sliced. Drives the linear MLP/proj/logit terms.
_flops_seqlen_sq_sum: the THD attention term Σᵢ sᵢ², computed inline from cu_seqlens (preferring cu_seqlens_unpadded). The per-pack sub-sequence lengths are reduced via :func:_scalar_sum_for_accumulator, which keeps the result on-device (no .item()) — so the per-microbatch path stays sync-free and the single host sync happens once per step in

func:

resolve_global_flops_seqlen_stats. When cu_seqlens is absent (dense / non-packed) or degenerate, the host-int BSHD fallback mbs * dense_seq_len² is accumulated instead (bit-exact with the pre-fix value). dense_seq_len is config_seq_len when provided, otherwise tokens.shape[1].
_flops_vision_patches: running total of num_vision_patches.

num_vision_patches is the precomputed number of vision patches in this microbatch (drives the ViT term). It is kept model-agnostic on purpose: the caller — which knows its own encoder’s layout — computes the count and passes a scalar (e.g. Qwen-VL sums grid_thw.prod(-1) over images and videos). May be an int or a scalar Tensor (a device tensor avoids a host sync here).

For THD packed training (offline packed LLM SFT or VLM in-batch packing), treating the whole pack as one length-seq_len sequence over-counts attention FLOPS by a large factor: actual attention work is Σᵢ sᵢ², not (Σᵢ sᵢ)². Using cu_seqlens here closes that gap.

bridge.training.utils.flop_utils.vit_flops( cfg: megatron.bridge.training.config.ConfigContainer, batch_size: int, num_patches: int, )#

Calculate FLOPs for a Vision Transformer (ViT) encoder + patch merger.

Includes:

ViT transformer layers (bidirectional full attention, not causal)
Patch merger (spatial merge + MLP projection to LLM hidden size)

Parameters:

cfg – Configuration container. ViT hyper-parameters are read from cfg.model.vision_config (depth, hidden_size, num_heads, intermediate_size, spatial_merge_size, out_hidden_size). Passing the whole config keeps the public signature stable as the list of required ViT attributes grows.
batch_size – Batch size.
num_patches – Per-image number of vision patches (before spatial merge). Callers that track the total patch count across the batch should divide by batch_size before invoking, because ViT attention is per-image (not cross-image) and scales quadratically with the per-image patch count.

Returns:

Total training FLOPs (forward * 3 for fwd+bwd). Returns 0 when no vision_config is attached or num_patches is non-positive.

bridge.training.utils.flop_utils.num_floating_point_operations( cfg: megatron.bridge.training.config.ConfigContainer, batch_size: int = 1, seqlen_sum: int | None = None, seqlen_squared_sum: int | None = None, num_vision_patches: int = 0, )#

Return the number of floating point operations.

Parameters:

cfg – Configuration container.
batch_size – Batch size.
seqlen_sum – Sum of actual sequence lengths across the batch (batch_size * actual_seq_length). When provided, overrides cfg.model.seq_length for more accurate FLOPS estimation with dynamic-length sequences (e.g., VLM with dynamic padding).
seqlen_squared_sum – Sum of squared sequence lengths across the batch (sum_i actual_seq_length_i^2). Used for attention core FLOPS which scale quadratically with sequence length; when omitted, falls back to batch_size * effective_seq_length^2 so the result matches the legacy constant-length estimate.
num_vision_patches – Total number of vision patches in the batch (before spatial merge). Used to compute ViT encoder FLOPS.

bridge.training.utils.flop_utils#

Module Contents#

Functions#

Data#

API#

`bridge.training.utils.flop_utils`#