`bridge.training.utils.theoretical_memory_utils`#

Computes theoretical memory footprint for model training.

Module Contents#

Functions#

`compute_weight_and_optimizer_memory`	Compute theoretical memory footprint for model weights and optimizer states.
`compute_activation_memory`	Compute theoretical memory footprint for activations.
`report_theoretical_memory`	Compute and print the theoretical memory footprint components.
`_get_vocab_size`	Get the potentially padded vocabulary size for the given configuration.

Data#

NUM_BYTES_IN_MEGABYTE

API#

bridge.training.utils.theoretical_memory_utils.NUM_BYTES_IN_MEGABYTE: int#: None

bridge.training.utils.theoretical_memory_utils.compute_weight_and_optimizer_memory( config: megatron.bridge.training.config.ConfigContainer, verbose: bool = False, ) → float#

Compute theoretical memory footprint for model weights and optimizer states.

Calculates the number of parameters for the model based on the configuration, determines the number of parameters on the most loaded shard considering pipeline and tensor parallelism, and estimates the memory needed based on bytes per parameter (considering precision and optimizer type).

Parameters:

config (ConfigContainer) – The main configuration container.
verbose (bool, optional) – If True, prints detailed parameter counts. Defaults to False.

Returns:

Estimated memory footprint in bytes for weights and optimizer states on the most loaded GPU shard.

Return type:

float

bridge.training.utils.theoretical_memory_utils.compute_activation_memory( config: megatron.bridge.training.config.ConfigContainer, num_microbatches: Optional[int], verbose: bool = False, ) → float#

Compute theoretical memory footprint for activations.

Estimates activation memory based on the formula from the Megatron-LM paper (Table 2, https://arxiv.org/pdf/2205.05198.pdf), accounting for sequence length, batch size, hidden size, number of layers, parallelism degrees (TP, PP, virtual PP), and other model specifics.

.. note::

Currently assumes selective activation recomputation and sequence parallelism. Calculations focus on the first pipeline stage, which typically has the highest activation memory footprint.

Parameters:

config (ConfigContainer) – The main configuration container.
num_microbatches (int, optional) – The number of microbatches used in training.
verbose (bool, optional) – If True, prints intermediate memory calculations. Defaults to False.

Returns:

Estimated activation memory footprint in bytes on a single GPU shard.

Return type:

float

bridge.training.utils.theoretical_memory_utils.report_theoretical_memory( config: megatron.bridge.training.config.ConfigContainer, num_microbatches: Optional[int] = None, verbose: bool = False, ) → None#

Compute and print the theoretical memory footprint components.

Calls compute_weight_and_optimizer_memory and compute_activation_memory (if applicable based on config) and prints the results in MB.

Parameters:

config (ConfigContainer) – The main configuration container.
num_microbatches (int, optional) – The number of microbatches. Required for accurate activation memory estimation with PP. Defaults to None.
verbose (bool, optional) – If True, passes verbosity flag to helper functions. Defaults to False.

bridge.training.utils.theoretical_memory_utils._get_vocab_size(model_cfg) → int#

Get the potentially padded vocabulary size for the given configuration.

Parameters:: cfg – The model provider configuration.
Returns:: The vocabulary size used.
Return type:: int

bridge.training.utils.theoretical_memory_utils#

Module Contents#

Functions#

Data#

API#

`bridge.training.utils.theoretical_memory_utils`#