core.optimizer.optimizer#

Megatron optimizer.

Module Contents#

Classes#

MegatronOptimizer

Base class for all Megatron optimizers.

MixedPrecisionOptimizer

Base class for both the float-16 and the distributed optimizer.

Float16OptimizerWithFloat16Params

Float16 optimizer for fp16 and bf16 data types.

FP32Optimizer

Float32 optimizer.

ProxyDict

A dictionary-like object that proxies to a list of dictionaries.

ChainedOptimizer

ChainedOptimizer is designed for a collection of optimizers.

Functions#

_zero_grad_group_helper

Zero out the gradient for a group of parameters. Note: copied from torch.optim.optimizer.

_multi_tensor_copy_this_to_that

Use multi-tensor-applier to copy values from one list to another. We don’t have a bfloat16 implementation so for now if the overflow_buf is not provided, we default back to simple loop copy to be compatible with bfloat16.

Data#

API#

core.optimizer.optimizer.logger#

‘getLogger(…)’

core.optimizer.optimizer._zero_grad_group_helper(
group: List[torch.nn.Parameter],
set_to_none: bool,
use_decoupled_grad: bool = False,
)#

Zero out the gradient for a group of parameters. Note: copied from torch.optim.optimizer.

core.optimizer.optimizer._multi_tensor_copy_this_to_that(
this: List[torch.Tensor],
that: List[torch.Tensor],
overflow_buf: Optional[torch.Tensor] = None,
)#

Use multi-tensor-applier to copy values from one list to another. We don’t have a bfloat16 implementation so for now if the overflow_buf is not provided, we default back to simple loop copy to be compatible with bfloat16.

core.optimizer.optimizer.param_group_identifier_keys#

(‘wd_mult’, ‘lr_mult’, ‘is_expert_parallel’, ‘is_decoupled_lr’)

class core.optimizer.optimizer.MegatronOptimizer(
optimizer: torch.optim.Optimizer,
config: core.optimizer.optimizer_config.OptimizerConfig,
init_state_fn: Callable = lambda x: ...,
)#

Bases: abc.ABC

Base class for all Megatron optimizers.

Parameters:
  • optimizer (torch.optim.Optimizer) – base optimizer such as Adam or SGD.

  • config (OptimizerConfig) – configuration object for optimizer.

  • init_state_fn (Callable, optional) – function to initialize state in the optimizer.

Initialization

Input optimizer is the base optimizer (e.g., Adam).

get_parameters() List[torch.nn.Parameter]#

Get list of parameters wrapped in optimizer.

get_main_grads_for_grad_norm() List[torch.Tensor]#

Get main_grads that should be taken into account to compute the grad norm. Filter parameters based on:

  • grad should not be None.

  • parameter should not be shared (i.e., grads shouldn’t be double counted while computing norms).

  • should not be a replica due to tensor model parallelism.

get_grad_stats_parallel_group() torch.distributed.ProcessGroup#

Process group for reducing gradient statistics (num_zeros & norm).

The two most common cases are:

  • Non-distributed optimizer (default): Return the model-parallel group.

  • Distributed optimizer (overridden in distrib_optimizer.py): Return the entire world.

abstractmethod prepare_grads() bool#

Pre-processing gradients before the optimizer step, returns whether inf/nan is found.

abstractmethod step_with_ready_grads() bool#

Step the optimizer with ready gradients, return successful.

get_grad_norm()#

Compute and return grad norm.

clip_grad_norm(clip_grad: float) float#

Compute and return grad norm, also clip grads.

count_zeros() float#

Count number of zeros in model’s gradients.

abstractmethod zero_grad(set_to_none: bool = True)#

Zero gradients and prepare for next forward pass.

abstractmethod get_loss_scale() torch.Tensor#

Get current loss scale factor. NOTE: The output should be a CUDA tensor of size 1.

scale_loss(loss: torch.Tensor) torch.Tensor#

Simple scaling.

abstractmethod reload_model_params(state_dict=None)#

Refreshes any internal state from the current model parameters. Call whenever the parameters are changed outside of the optimizer. For example, when we load a model from a checkpoint without loading the optimizer, the model parameters are updated but for fp16 optimizer with main parameters, the main parameters need to also be updated.

Parameters:

state_dict (dict, optional) – When it is not None, we use the params from the input state_dict to initialize the main params, instead of using the model params for initialization. This is useful when the precision of the model params is lower than that of the params from the state dict, as it allows the main params to be more accurate.

abstractmethod state_dict()#

Return state_dict.

abstractmethod load_state_dict(state_dict)#

Load pass-in state_dict.

_get_state()#
_set_state(value)#
state#

‘property(…)’

_get_param_groups()#
_set_param_groups(value)#
param_groups#

‘property(…)’

abstractmethod step()#

Step the optimizer.

abstractmethod sharded_state_dict(
model_sharded_state_dict: core.dist_checkpointing.mapping.ShardedStateDict,
is_loading: bool = False,
metadata: Optional[dict] = None,
) core.dist_checkpointing.mapping.ShardedStateDict#

Builds sharded state dict for the optimizer, based on model’s sharded state dict.

Parameters:
  • model_sharded_state_dict (ShardedStateDict) – sharded state dict of the model

  • is_loading (bool, optional) – flag indicating whether the state dict will be used to save or load the optimizer state. Defaults to False.

  • metadata (dict, optional) – metadata controlling the sharded_state_dict logic.

Returns: optimizer sharded state dict

static _extract_common_per_param_step(
state_dict,
) Union[int, torch.Tensor, None]#
static _restore_common_per_param_step(
state_dict: Dict,
step: Union[int, torch.Tensor],
)#
offload_to_cpu()#

Function used for RL training. Move optimizer state tensors to CPU to free GPU memory during inference.

restore_from_cpu()#

Function used for RL training. Restore optimizer state tensors from CPU back to GPU for training.

static _filter_and_reorder_param_groups(
current_groups: List[Dict],
state_dict_groups: List[Dict],
) List[Dict]#

Filter and reorder state_dict parameter groups to match current optimizer groups. Keys used for matching align with those from _get_param_groups: (wd_mult, lr_mult, is_expert_parallel, is_decoupled_lr)

Parameters:
  • current_groups (List[Dict]) – Parameter groups from the current optimizer instance.

  • state_dict_groups (List[Dict]) – Parameter groups loaded from a state dict.

Returns:

Filtered and reordered parameter groups matching the current optimizer.

Return type:

List[Dict]

Raises:

ValueError – If parameter groups in state dict don’t match current optimizer.

class core.optimizer.optimizer.MixedPrecisionOptimizer(
optimizer: torch.optim.Optimizer,
config: core.optimizer.optimizer_config.OptimizerConfig,
grad_scaler: Optional[core.optimizer.grad_scaler.MegatronGradScaler],
init_state_fn: Callable,
)#

Bases: core.optimizer.optimizer.MegatronOptimizer

Base class for both the float-16 and the distributed optimizer.

Parameters:
  • optimizer (torch.optim.Optimizer) – base optimizer such as Adam or SGD.

  • config (OptimizerConfig) – configuration object for optimizer.

  • grad_scaler (MegatronGradScaler) – used for scaling gradients. Note that this can be None. This case happens when bf16 = True and we don’t use any loss scale. Note that for bf16 = True, we can have a constant gradient scaler. Also for bf16 = False, we always require a grad scaler.

  • init_state_fn (Callable, optional) – function to initialize state in the optimizer.

Initialization

Input optimizer is the base optimizer (e.g., Adam).

get_loss_scale()#
reload_model_params(state_dict=None)#
_unscale_main_grads_and_check_for_nan()#
prepare_grads() bool#

Pre-processing gradients before the optimizer step, returns whether inf/nan is found.

step_with_ready_grads() bool#

Step the optimizer with ready gradients, return successful.

step()#
class core.optimizer.optimizer.Float16OptimizerWithFloat16Params(
optimizer: torch.optim.Optimizer,
config: core.optimizer.optimizer_config.OptimizerConfig,
grad_scaler: core.optimizer.grad_scaler.MegatronGradScaler,
init_state_fn: Callable,
)#

Bases: core.optimizer.optimizer.MixedPrecisionOptimizer

Float16 optimizer for fp16 and bf16 data types.

Parameters:
  • optimizer (torch.optim.Optimizer) – base optimizer such as Adam or SGD.

  • config (OptimizerConfig) – configuration object for optimizer.

  • grad_scaler (MegatronGradScaler) – used for scaling gradients. Note that this can be None. This case happens when bf16 = True and we don’t use any loss scale. Note that for bf16 = True, we can have a constant gradient scaler. Also for bf16 = False, we always require a grad scaler.

  • init_state_fn (Callable, optional) – function to initialize state in the optimizer.

Initialization

Input optimizer is the base optimizer (e.g., Adam).

zero_grad(set_to_none=True)#

We only need to zero the model related parameters, i.e., float16_groups & fp32_from_fp32_groups. We additionally zero fp32_from_float16_groups as a memory optimization to reduce fragmentation; in the case of set_to_none==True, the space used by this field can be safely deallocated at this point.

_collect_main_grad_data_for_unscaling()#
_get_model_and_main_params_data_float16()#
_copy_model_grads_to_main_grads()#
_copy_main_params_to_model_params()#
_copy_model_params_to_main_params(state_dict=None)#
state_dict(is_loading: bool = False)#
sharded_state_dict(
model_sharded_state_dict: core.dist_checkpointing.mapping.ShardedStateDict,
is_loading: bool = False,
metadata: Optional[dict] = None,
)#
load_state_dict(state_dict)#
class core.optimizer.optimizer.FP32Optimizer(
optimizer: torch.optim.Optimizer,
config: core.optimizer.optimizer_config.OptimizerConfig,
init_state_fn: Callable,
)#

Bases: core.optimizer.optimizer.MegatronOptimizer

Float32 optimizer.

Parameters:
  • optimizer (torch.optim.Optimizer) – base optimizer such as Adam or SGD.

  • config (OptimizerConfig) – configuration object for optimizer.

  • init_state_fn (Callable, optional) – function to initialize state in the optimizer.

Initialization

Input optimizer is the base optimizer (e.g., Adam).

zero_grad(set_to_none=True)#

Copied from torch.optim.optimizer

get_loss_scale()#

FP32 optimizer does not do any scaling.

prepare_grads() bool#

Pre-processing gradients before the optimizer step, returns whether inf/nan is found.

step_with_ready_grads() bool#

Step the optimizer with ready gradients, return successful.

step()#

Clip gradients (if needed) and step the base optimizer. Always return successful since there is no overflow.

reload_model_params(state_dict=None)#
state_dict()#
load_state_dict(state_dict)#
sharded_state_dict(
model_sharded_state_dict: core.dist_checkpointing.mapping.ShardedStateDict,
is_loading: bool = False,
metadata: Optional[dict] = None,
)#
class core.optimizer.optimizer.ProxyDict(inner_dicts: List[dict])#

A dictionary-like object that proxies to a list of dictionaries.

e.g., ProxyDict([{‘a’: 1}, {‘b’: 2}]) behaves like: { (0, ‘a’): 1, (1, ‘b’): 2, } We use tuples as keys to avoid ambiguity with the keys of the inner dicts.

Initialization

__getitem__(key: Tuple[int, str])#
__setitem__(key: Tuple[int, str], value: Any)#
__len__() int#
__iter__()#
items()#

Return generator over underlying items.

class core.optimizer.optimizer.ChainedOptimizer(
chained_optimizers: List[core.optimizer.optimizer.MegatronOptimizer],
)#

Bases: core.optimizer.optimizer.MegatronOptimizer

ChainedOptimizer is designed for a collection of optimizers.

These optimizers are responsible for different parts of multiple models for a training task and will be executed one-by-one when the model is updated.

Parameters:

chained_optimizers – a list of optimizers.

Initialization

Input optimizer is the base optimizer (e.g., Adam).

property optimizer#

Access underlying optimizer when only one optimizer included for backward compatibility.

property param_groups: List[dict]#

Get param_groups aggregated over underlying optimizers.

property state: core.optimizer.optimizer.ProxyDict#

Return optimizer state with tuple keys, where the first element is the index of the optimizer in the list of chained optimizers.

zero_grad(set_to_none=True)#
get_loss_scale()#
_split_state_dict(state_dict)#

Split the state dict into sub-state dicts according to the chunks of each sub-optimizer in this chained optimizer.

For example, assume there are two sub-optimizers in total: the first has 1 model chunk, and the second has 7 model chunks. The state dict contains model0 ~ model7. This function splits the state dict into two sub-state dicts: the first contains model0, and the second contains model1 ~ model7 (but renamed as model0 ~ model6).

reload_model_params(state_dict=None)#
state_dict()#
sharded_state_dict(
model_sharded_state_dict: core.dist_checkpointing.mapping.ShardedStateDict,
is_loading: bool = False,
**kwargs,
)#
load_state_dict(state_dict)#
prepare_grads() bool#

Pre-processing gradients before the optimizer step, returns whether inf/nan is found.

step_with_ready_grads() bool#

Step the optimizer with ready gradients, return successful.

grads_states_parallel_group_is_shared()#

Check if all optimizers share the same gradient statistics parallel group.

get_grad_stats_parallel_group() torch.distributed.ProcessGroup#
get_grad_norm()#
count_zeros()#
step()#

ChainedOptimizer will step all optimizers one by one.

save_parameter_state(filename: str)#

Save the distributed parameter states of all optimizers to a file.

Parameters:

filename (str) – path to save parameter state to.

load_parameter_state(
filename: str,
*,
update_legacy_format: bool = False,
)#

Load the distributed parameter states of all optimizers from a file.

Parameters:

filename (str) – path to load parameter state from.

_synchronize_steps()#

Synchronize the step of all optimizers. TE FusedAdam will not accumulate “step” for empty param groups, so we need to align the step across param groups before saving and after loading.

offload_to_cpu()#

Move optimizer state to CPU to free GPU memory during inference.

restore_from_cpu()#

Restore optimizer state from CPU back to GPU for training.