core.optimizer#

Subpackages#

Submodules#

Package Contents#

Functions#

get_standard_config_overrides

Get standard config overrides for the optimizer, handling decoupled LR and common wd skips.

_get_param_groups

Create parameter groups for optimizer.

_get_param_groups_and_buffers

Returns parameter groups and buffer for optimizer.

_get_megatron_optimizer_based_on_param_groups

Get Megatron optimizer based on parameter groups.

check_config_overrides_consistency

Check if the config overrides are consistent with the config.

get_megatron_optimizer

Retrieve the Megatron optimizer for model chunks.

Data#

API#

core.optimizer.logger#

‘getLogger(…)’

core.optimizer.get_standard_config_overrides(
decoupled_lr: float | None = None,
decoupled_min_lr: float | None = None,
) Dict[core.optimizer.optimizer_config.ParamKey, megatron.core.optimizer_param_scheduler.ParamGroupOverride]#

Get standard config overrides for the optimizer, handling decoupled LR and common wd skips.

Parameters:
  • decoupled_lr (float | None) – decoupled learning rate.

  • decoupled_min_lr (float | None) – decoupled minimum learning rate.

Returns:

standard config overrides.

Return type:

Dict[ParamKey, ParamGroupOverride]

core.optimizer._get_param_groups(
model_chunks: List[core.transformer.module.MegatronModule],
config: core.optimizer.optimizer_config.OptimizerConfig,
config_overrides: Optional[Dict[core.optimizer.optimizer_config.ParamKey, megatron.core.optimizer_param_scheduler.ParamGroupOverride]],
) List[Dict]#

Create parameter groups for optimizer.

Creates parameter groups from provided optimizer config object.

NOTE There can be more than one match between a ParamKey and a parameter. What we do is merge all of the matching ParamKey overrides into a single ParamGroupOverride for that parameter and use that as the key for that parameter. Any parameters that get the same set of merged overrides will be mapped into the same parameter group.

Parameters:
  • model_chunks (List[MegatronModule]) – model chunks to create parameter groups for.

  • config (OptimizerConfig) – optimizer configuration object.

  • config_overrides (Optional[Dict[ParamKey, ParamGroupOverride]) – optimizer overrides, specified on a per-layer basis. NOTE: if you want to skip applying weight decay on bias and length 1 parameters, and also do not want to do any other overrides, set this to an empty dictionary rather than the default value of None.

Returns:

List of parameter groups.

core.optimizer._get_param_groups_and_buffers(
model_chunks: List[core.transformer.module.MegatronModule],
model_chunk_offset: int,
config: core.optimizer.optimizer_config.OptimizerConfig,
config_overrides: Optional[Dict[core.optimizer.optimizer_config.ParamKey, megatron.core.optimizer_param_scheduler.ParamGroupOverride]],
filter_fn: Callable,
buffer_name: str,
) Tuple[List[Dict], Dict[int, List[core.distributed.param_and_grad_buffer._ParamAndGradBuffer]]]#

Returns parameter groups and buffer for optimizer.

Parameters:
  • model_chunks (List[MegatronModule]) – model chunks to create parameter groups for.

  • model_chunk_offset (int) – offset of model_chunks in global model_chunks list.

  • config (OptimizerConfig) – optimizer configuration object.

  • config_overrides (Optional[Dict[ParamKey, ParamGroupOverride]) – optimizer/scheduler overrides, specified on the basis of ParamKey matches with each parameter.

  • lr (float) – learning rate.

  • min_lr (float) – minimum learning rate.

  • filter_fn (callable) – filtering function for param_groups.

  • buffer_name (str) – name of buffer.

Returns:

List of parameter groups and dictionary of model chunk IDs to buffers.

core.optimizer._get_megatron_optimizer_based_on_param_groups(
config: core.optimizer.optimizer_config.OptimizerConfig,
model_chunks: List[core.transformer.module.MegatronModule],
param_groups: List,
per_model_buffers: Optional[Dict[int, List[core.distributed.param_and_grad_buffer._ParamAndGradBuffer]]] = None,
model_parallel_group: Optional[torch.distributed.ProcessGroup] = None,
data_parallel_group: Optional[torch.distributed.ProcessGroup] = None,
data_parallel_group_gloo: Optional[torch.distributed.ProcessGroup] = None,
data_parallel_group_idx: Optional[int] = None,
intra_dist_opt_group: Optional[torch.distributed.ProcessGroup] = None,
distributed_optimizer_instance_id: Optional[int] = 0,
) core.optimizer.optimizer.MegatronOptimizer#

Get Megatron optimizer based on parameter groups.

Parameters:
  • config (OptimizerConfig) – optimizer configuration object.

  • model_chunks (list) – list of model chunks.

  • param_groups (list) – list of parameter groups.

  • per_model_buffers (dict, optional) – buffers for distributed optimizer. Defaults to None.

  • data_parallel_group (torch.distributed.ProcessGroup, optional) – data-parallel group for distributed optimizer. Defaults to None.

  • data_parallel_group_gloo (torch.distributed.ProcessGroup, optional) – gloo data-parallel group for distributed optimizer. Defaults to None.

  • data_parallel_group_idx (int, optional) – data-parallel group index for distributed optimizer. Defaults to None.

  • distributed_optimizer_instance_id (int, optional) – Distributed optimizer instance. Defaults 0.

Returns:

Instance of MegatronOptimizer.

core.optimizer.check_config_overrides_consistency(
config: core.optimizer.optimizer_config.OptimizerConfig,
config_overrides: Optional[Dict[core.optimizer.optimizer_config.ParamKey, megatron.core.optimizer_param_scheduler.ParamGroupOverride]],
)#

Check if the config overrides are consistent with the config.

core.optimizer.get_megatron_optimizer(
config: core.optimizer.optimizer_config.OptimizerConfig,
model_chunks: List[core.transformer.module.MegatronModule],
config_overrides: Optional[Dict[core.optimizer.optimizer_config.ParamKey, megatron.core.optimizer_param_scheduler.ParamGroupOverride]] = None,
use_gloo_process_groups: bool = True,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
dump_param_to_param_group_map: Optional[str] = None,
) core.optimizer.optimizer.MegatronOptimizer#

Retrieve the Megatron optimizer for model chunks.

We use separate optimizers for expert parameters and non-expert parameters.

Parameters:
  • config (OptimizerConfig) – optimizer configuration object.

  • model_chunks (List[MegatronModule]) – model chunks to get optimizer for.

  • config_overrides (Optional[Dict[ParamKey, OptimizerConfig]]) – optional dictionary of optimizer configuration objects to override default optimizer behavior for different subsets of parameters (identified by ParamKey).

  • use_gloo_process_groups (bool) – if false, disable use of Gloo process groups in underlying Megatron optimizers.

  • pg_collection – Optional unified process group for distributed training.

  • dump_param_to_param_group_map (Optional[str]) – path to dump parameter to param group map.

Returns:

Instance of MegatronOptimizer.