bridge.training.utils.theoretical_memory_utils#
Formula-based theoretical memory estimates for model training.
The estimator logic is adapted for Megatron Bridge from the public ISEEKYAN Megatron memory estimator implementation: https://github.com/ISEEKYAN/mbridge/tree/main/memory_estimator
Module Contents#
Classes#
Estimated memory for one per-GPU training memory component. |
|
Structured theoretical per-GPU memory estimate for Bridge training. |
|
Functions#
Estimate per-GPU training memory for a Bridge GPT-like model config. |
|
Format a theoretical memory estimate as a compact single-line summary. |
|
Compute theoretical memory footprint for model weights and optimizer states. |
|
Compute theoretical memory footprint for activations. |
|
Compute and print the theoretical memory footprint components. |
|
Get the potentially padded vocabulary size for the given configuration. |
Data#
API#
- bridge.training.utils.theoretical_memory_utils.NUM_BYTES_IN_MEGABYTE: int#
None
- bridge.training.utils.theoretical_memory_utils.NUM_BYTES_IN_GIGABYTE: int#
None
- class bridge.training.utils.theoretical_memory_utils.MemoryComponentEstimate#
Estimated memory for one per-GPU training memory component.
- Parameters:
name – Human-readable component name.
parameter_count – Global parameter count covered by this component.
parameter_count_per_gpu – Parameter count on the most-loaded GPU shard.
bytes_per_parameter – Per-parameter bytes for weights, gradients, and optimizer states.
memory_bytes – Estimated memory on the most-loaded GPU shard.
- name: str#
None
- parameter_count: float#
0.0
- parameter_count_per_gpu: float#
0.0
- bytes_per_parameter: float#
0.0
- memory_bytes: float#
0.0
- property memory_mb: float#
Memory in MiB.
- property memory_gb: float#
Memory in GiB.
- class bridge.training.utils.theoretical_memory_utils.TrainingMemoryEstimate#
Structured theoretical per-GPU memory estimate for Bridge training.
- Parameters:
model_state_components – Weight, gradient, and optimizer-state components.
activation – Activation component, if activation estimation was requested.
total_parameters – Global model parameter count covered by the estimator.
assumptions – Estimator assumptions and intentionally unsupported details.
- model_state_components: tuple[bridge.training.utils.theoretical_memory_utils.MemoryComponentEstimate, ...]#
None
- activation: bridge.training.utils.theoretical_memory_utils.MemoryComponentEstimate | None#
None
- total_parameters: float#
None
- assumptions: tuple[str, ...]#
None
- property weight_and_optimizer_bytes: float#
Estimated per-GPU memory for weights, gradients, and optimizer states.
- property total_memory_bytes: float#
Estimated per-GPU training memory for all available components.
- property total_memory_mb: float#
Total estimated per-GPU memory in MiB.
- property total_memory_gb: float#
Total estimated per-GPU memory in GiB.
- class bridge.training.utils.theoretical_memory_utils._LayerCounts#
- dense: int#
None
- moe: int#
None
- total: int#
None
- class bridge.training.utils.theoretical_memory_utils._ParameterCounts#
- dense_transformer: float#
None
- routed_experts: float#
None
- embeddings: float#
None
- property total: float#
- bridge.training.utils.theoretical_memory_utils.estimate_training_memory(
- config: megatron.bridge.training.config.ConfigContainer,
- num_microbatches: int | None = None,
- *,
- include_activation: bool = True,
Estimate per-GPU training memory for a Bridge GPT-like model config.
The estimator is intentionally formula-based. It does not instantiate a Megatron model or import UI/debug dependencies from the external prototype linked in issue #1673. The returned structure separates dense/embedding model state, routed expert model state, and activation memory so callers can display or post-process the breakdown.
The estimator logic is adapted from the public ISEEKYAN Megatron memory estimator implementation.
- Parameters:
config – Bridge training configuration container.
num_microbatches – Number of microbatches in the pipeline schedule. Supplying this improves activation estimates when PP is enabled.
include_activation – Include the activation-memory estimate. The activation formula assumes sequence parallelism and selective recomputation, matching the legacy training-time report.
- Returns:
Structured per-GPU theoretical memory estimate.
- bridge.training.utils.theoretical_memory_utils.format_training_memory_estimate(
- estimate: bridge.training.utils.theoretical_memory_utils.TrainingMemoryEstimate,
- *,
- unit: str = 'MB',
Format a theoretical memory estimate as a compact single-line summary.
- Parameters:
estimate – Structured estimate returned by :func:
estimate_training_memory.unit – Either
"MB"for MiB output or"GB"for GiB output.
- Returns:
Human-readable summary string.
- Raises:
ValueError – If
unitis not"MB"or"GB".
- bridge.training.utils.theoretical_memory_utils.compute_weight_and_optimizer_memory(
- config: megatron.bridge.training.config.ConfigContainer,
- verbose: bool = False,
Compute theoretical memory footprint for model weights and optimizer states.
Calculates the number of parameters for the model based on the configuration, determines the number of parameters on the most loaded shard considering pipeline and tensor parallelism, and estimates the memory needed based on bytes per parameter (considering precision and optimizer type).
- Parameters:
config (ConfigContainer) – The main configuration container.
verbose (bool, optional) – If True, prints detailed parameter counts. Defaults to False.
- Returns:
Estimated memory footprint in bytes for weights and optimizer states on the most loaded GPU shard.
- Return type:
float
- bridge.training.utils.theoretical_memory_utils.compute_activation_memory(
- config: megatron.bridge.training.config.ConfigContainer,
- num_microbatches: int | None,
- verbose: bool = False,
Compute theoretical memory footprint for activations.
Estimates activation memory based on the formula from the Megatron-LM paper (Table 2, https://arxiv.org/pdf/2205.05198.pdf), accounting for sequence length, batch size, hidden size, number of layers, parallelism degrees (TP, PP, virtual PP), and other model specifics.
.. note::
Currently assumes selective activation recomputation and sequence parallelism. Calculations focus on the first pipeline stage, which typically has the highest activation memory footprint.
- Parameters:
config (ConfigContainer) – The main configuration container.
num_microbatches (int, optional) – The number of microbatches used in training.
verbose (bool, optional) – If True, prints intermediate memory calculations. Defaults to False.
- Returns:
Estimated activation memory footprint in bytes on a single GPU shard.
- Return type:
float
- bridge.training.utils.theoretical_memory_utils.report_theoretical_memory(
- config: megatron.bridge.training.config.ConfigContainer,
- num_microbatches: int | None = None,
- verbose: bool = False,
Compute and print the theoretical memory footprint components.
Calls
compute_weight_and_optimizer_memoryandcompute_activation_memory(if applicable based on config) and prints the results in MB.- Parameters:
config (ConfigContainer) – The main configuration container.
num_microbatches (int, optional) – The number of microbatches. Required for accurate activation memory estimation with PP. Defaults to None.
verbose (bool, optional) – If True, passes verbosity flag to helper functions. Defaults to False.
- bridge.training.utils.theoretical_memory_utils._compute_activation_memory_bytes(
- config: megatron.bridge.training.config.ConfigContainer,
- *,
- num_microbatches: int | None,
- verbose: bool = False,
- bridge.training.utils.theoretical_memory_utils._count_parameters(
- model_config: object,
- bridge.training.utils.theoretical_memory_utils._get_layer_counts(
- model_config: object,
- bridge.training.utils.theoretical_memory_utils._with_mtp_layers(
- model_config: object,
- *,
- dense: int,
- moe: int,
- bridge.training.utils.theoretical_memory_utils._embedding_parameters_on_most_loaded_shard(
- model_config: object,
- bridge.training.utils.theoretical_memory_utils._bytes_per_parameter(
- config: megatron.bridge.training.config.ConfigContainer,
- optimizer_shard_size: int,
- bridge.training.utils.theoretical_memory_utils._expert_optimizer_shard_size(
- config: megatron.bridge.training.config.ConfigContainer,
- *,
- tensor_parallel_size: int,
- context_parallel_size: int,
- expert_parallel_size: int,
- expert_tensor_parallel_size: int,
- bridge.training.utils.theoretical_memory_utils._estimate_assumptions(
- config: megatron.bridge.training.config.ConfigContainer,
- *,
- include_activation: bool,
- has_routed_experts: bool,
- bridge.training.utils.theoretical_memory_utils._print_parameter_summary( ) None#
- bridge.training.utils.theoretical_memory_utils._ffn_projection_factor(model_config: object) float#
- bridge.training.utils.theoretical_memory_utils._ffn_activation_ratio(
- model_config: object,
- layer_counts: bridge.training.utils.theoretical_memory_utils._LayerCounts,
- bridge.training.utils.theoretical_memory_utils._has_moe(model_config: object) bool#
- bridge.training.utils.theoretical_memory_utils._positive_int_attr(config: object, name: str, default: int) int#
- bridge.training.utils.theoretical_memory_utils._optional_positive_int_attr(
- config: object,
- name: str,
- default: int,
- bridge.training.utils.theoretical_memory_utils._get_vocab_size(model_cfg) int#
Get the potentially padded vocabulary size for the given configuration.
- Parameters:
cfg – The model provider configuration.
- Returns:
The vocabulary size used.
- Return type:
int