bridge.training.utils.theoretical_memory_utils#

Formula-based theoretical memory estimates for model training.

The estimator logic is adapted for Megatron Bridge from the public ISEEKYAN Megatron memory estimator implementation: https://github.com/ISEEKYAN/mbridge/tree/main/memory_estimator

Module Contents#

Classes#

MemoryComponentEstimate

Estimated memory for one per-GPU training memory component.

TrainingMemoryEstimate

Structured theoretical per-GPU memory estimate for Bridge training.

_LayerCounts

_ParameterCounts

Functions#

estimate_training_memory

Estimate per-GPU training memory for a Bridge GPT-like model config.

format_training_memory_estimate

Format a theoretical memory estimate as a compact single-line summary.

compute_weight_and_optimizer_memory

Compute theoretical memory footprint for model weights and optimizer states.

compute_activation_memory

Compute theoretical memory footprint for activations.

report_theoretical_memory

Compute and print the theoretical memory footprint components.

_compute_activation_memory_bytes

_count_parameters

_get_layer_counts

_with_mtp_layers

_embedding_parameters_on_most_loaded_shard

_bytes_per_parameter

_expert_optimizer_shard_size

_estimate_assumptions

_print_parameter_summary

_ffn_projection_factor

_ffn_activation_ratio

_has_moe

_positive_int_attr

_optional_positive_int_attr

_get_vocab_size

Get the potentially padded vocabulary size for the given configuration.

Data#

API#

bridge.training.utils.theoretical_memory_utils.NUM_BYTES_IN_MEGABYTE: int#

None

bridge.training.utils.theoretical_memory_utils.NUM_BYTES_IN_GIGABYTE: int#

None

class bridge.training.utils.theoretical_memory_utils.MemoryComponentEstimate#

Estimated memory for one per-GPU training memory component.

Parameters:
  • name – Human-readable component name.

  • parameter_count – Global parameter count covered by this component.

  • parameter_count_per_gpu – Parameter count on the most-loaded GPU shard.

  • bytes_per_parameter – Per-parameter bytes for weights, gradients, and optimizer states.

  • memory_bytes – Estimated memory on the most-loaded GPU shard.

name: str#

None

parameter_count: float#

0.0

parameter_count_per_gpu: float#

0.0

bytes_per_parameter: float#

0.0

memory_bytes: float#

0.0

property memory_mb: float#

Memory in MiB.

property memory_gb: float#

Memory in GiB.

class bridge.training.utils.theoretical_memory_utils.TrainingMemoryEstimate#

Structured theoretical per-GPU memory estimate for Bridge training.

Parameters:
  • model_state_components – Weight, gradient, and optimizer-state components.

  • activation – Activation component, if activation estimation was requested.

  • total_parameters – Global model parameter count covered by the estimator.

  • assumptions – Estimator assumptions and intentionally unsupported details.

model_state_components: tuple[bridge.training.utils.theoretical_memory_utils.MemoryComponentEstimate, ...]#

None

activation: bridge.training.utils.theoretical_memory_utils.MemoryComponentEstimate | None#

None

total_parameters: float#

None

assumptions: tuple[str, ...]#

None

property weight_and_optimizer_bytes: float#

Estimated per-GPU memory for weights, gradients, and optimizer states.

property total_memory_bytes: float#

Estimated per-GPU training memory for all available components.

property total_memory_mb: float#

Total estimated per-GPU memory in MiB.

property total_memory_gb: float#

Total estimated per-GPU memory in GiB.

class bridge.training.utils.theoretical_memory_utils._LayerCounts#
dense: int#

None

moe: int#

None

total: int#

None

class bridge.training.utils.theoretical_memory_utils._ParameterCounts#
dense_transformer: float#

None

routed_experts: float#

None

embeddings: float#

None

property total: float#
bridge.training.utils.theoretical_memory_utils.estimate_training_memory(
config: megatron.bridge.training.config.ConfigContainer,
num_microbatches: int | None = None,
*,
include_activation: bool = True,
) bridge.training.utils.theoretical_memory_utils.TrainingMemoryEstimate#

Estimate per-GPU training memory for a Bridge GPT-like model config.

The estimator is intentionally formula-based. It does not instantiate a Megatron model or import UI/debug dependencies from the external prototype linked in issue #1673. The returned structure separates dense/embedding model state, routed expert model state, and activation memory so callers can display or post-process the breakdown.

The estimator logic is adapted from the public ISEEKYAN Megatron memory estimator implementation.

Parameters:
  • config – Bridge training configuration container.

  • num_microbatches – Number of microbatches in the pipeline schedule. Supplying this improves activation estimates when PP is enabled.

  • include_activation – Include the activation-memory estimate. The activation formula assumes sequence parallelism and selective recomputation, matching the legacy training-time report.

Returns:

Structured per-GPU theoretical memory estimate.

bridge.training.utils.theoretical_memory_utils.format_training_memory_estimate(
estimate: bridge.training.utils.theoretical_memory_utils.TrainingMemoryEstimate,
*,
unit: str = 'MB',
) str#

Format a theoretical memory estimate as a compact single-line summary.

Parameters:
  • estimate – Structured estimate returned by :func:estimate_training_memory.

  • unit – Either "MB" for MiB output or "GB" for GiB output.

Returns:

Human-readable summary string.

Raises:

ValueError – If unit is not "MB" or "GB".

bridge.training.utils.theoretical_memory_utils.compute_weight_and_optimizer_memory(
config: megatron.bridge.training.config.ConfigContainer,
verbose: bool = False,
) float#

Compute theoretical memory footprint for model weights and optimizer states.

Calculates the number of parameters for the model based on the configuration, determines the number of parameters on the most loaded shard considering pipeline and tensor parallelism, and estimates the memory needed based on bytes per parameter (considering precision and optimizer type).

Parameters:
  • config (ConfigContainer) – The main configuration container.

  • verbose (bool, optional) – If True, prints detailed parameter counts. Defaults to False.

Returns:

Estimated memory footprint in bytes for weights and optimizer states on the most loaded GPU shard.

Return type:

float

bridge.training.utils.theoretical_memory_utils.compute_activation_memory(
config: megatron.bridge.training.config.ConfigContainer,
num_microbatches: int | None,
verbose: bool = False,
) float#

Compute theoretical memory footprint for activations.

Estimates activation memory based on the formula from the Megatron-LM paper (Table 2, https://arxiv.org/pdf/2205.05198.pdf), accounting for sequence length, batch size, hidden size, number of layers, parallelism degrees (TP, PP, virtual PP), and other model specifics.

.. note::

Currently assumes selective activation recomputation and sequence parallelism. Calculations focus on the first pipeline stage, which typically has the highest activation memory footprint.

Parameters:
  • config (ConfigContainer) – The main configuration container.

  • num_microbatches (int, optional) – The number of microbatches used in training.

  • verbose (bool, optional) – If True, prints intermediate memory calculations. Defaults to False.

Returns:

Estimated activation memory footprint in bytes on a single GPU shard.

Return type:

float

bridge.training.utils.theoretical_memory_utils.report_theoretical_memory(
config: megatron.bridge.training.config.ConfigContainer,
num_microbatches: int | None = None,
verbose: bool = False,
) None#

Compute and print the theoretical memory footprint components.

Calls compute_weight_and_optimizer_memory and compute_activation_memory (if applicable based on config) and prints the results in MB.

Parameters:
  • config (ConfigContainer) – The main configuration container.

  • num_microbatches (int, optional) – The number of microbatches. Required for accurate activation memory estimation with PP. Defaults to None.

  • verbose (bool, optional) – If True, passes verbosity flag to helper functions. Defaults to False.

bridge.training.utils.theoretical_memory_utils._compute_activation_memory_bytes(
config: megatron.bridge.training.config.ConfigContainer,
*,
num_microbatches: int | None,
verbose: bool = False,
) float#
bridge.training.utils.theoretical_memory_utils._count_parameters(
model_config: object,
) bridge.training.utils.theoretical_memory_utils._ParameterCounts#
bridge.training.utils.theoretical_memory_utils._get_layer_counts(
model_config: object,
) bridge.training.utils.theoretical_memory_utils._LayerCounts#
bridge.training.utils.theoretical_memory_utils._with_mtp_layers(
model_config: object,
*,
dense: int,
moe: int,
) bridge.training.utils.theoretical_memory_utils._LayerCounts#
bridge.training.utils.theoretical_memory_utils._embedding_parameters_on_most_loaded_shard(
model_config: object,
) float#
bridge.training.utils.theoretical_memory_utils._bytes_per_parameter(
config: megatron.bridge.training.config.ConfigContainer,
optimizer_shard_size: int,
) float#
bridge.training.utils.theoretical_memory_utils._expert_optimizer_shard_size(
config: megatron.bridge.training.config.ConfigContainer,
*,
tensor_parallel_size: int,
context_parallel_size: int,
expert_parallel_size: int,
expert_tensor_parallel_size: int,
) int#
bridge.training.utils.theoretical_memory_utils._estimate_assumptions(
config: megatron.bridge.training.config.ConfigContainer,
*,
include_activation: bool,
has_routed_experts: bool,
) tuple[str, ...]#
bridge.training.utils.theoretical_memory_utils._print_parameter_summary(
estimate: bridge.training.utils.theoretical_memory_utils.TrainingMemoryEstimate,
) None#
bridge.training.utils.theoretical_memory_utils._ffn_projection_factor(model_config: object) float#
bridge.training.utils.theoretical_memory_utils._ffn_activation_ratio(
model_config: object,
layer_counts: bridge.training.utils.theoretical_memory_utils._LayerCounts,
) float#
bridge.training.utils.theoretical_memory_utils._has_moe(model_config: object) bool#
bridge.training.utils.theoretical_memory_utils._positive_int_attr(config: object, name: str, default: int) int#
bridge.training.utils.theoretical_memory_utils._optional_positive_int_attr(
config: object,
name: str,
default: int,
) int#
bridge.training.utils.theoretical_memory_utils._get_vocab_size(model_cfg) int#

Get the potentially padded vocabulary size for the given configuration.

Parameters:

cfg – The model provider configuration.

Returns:

The vocabulary size used.

Return type:

int