core.optimizer#

Subpackages#

Submodules#

Package Contents#

Functions#

_matches

Returns true if passed-in parameter (with name) matches param_key.

_get_param_groups

Create parameter groups for optimizer.

_get_param_groups_and_buffers

Returns parameter groups and buffer for optimizer.

_get_megatron_optimizer_based_on_param_groups

Get Megatron optimizer based on parameter groups.

get_megatron_optimizer

Retrieve the Megatron optimizer for model chunks.

Data#

API#

core.optimizer.logger#

‘getLogger(…)’

core.optimizer._matches(
param: torch.nn.Parameter,
param_name: str,
param_key: core.optimizer.optimizer_config.ParamKey,
) bool#

Returns true if passed-in parameter (with name) matches param_key.

Parameters:
  • param (torch.nn.Parameter) – Handle to parameter object.

  • param_name (str) – Name of parameter in underlying PyTorch module.

  • param_key (ParamKey) – ParamKey object.

Returns:

True if parameter matches passed-in param_key.

Return type:

bool

core.optimizer._get_param_groups(
model_chunks: List[core.transformer.module.MegatronModule],
config: core.optimizer.optimizer_config.OptimizerConfig,
config_overrides: Optional[Dict[core.optimizer.optimizer_config.ParamKey, core.optimizer.optimizer_config.OptimizerConfig]],
) List[Dict]#

Create parameter groups for optimizer.

Creates parameter groups from provided optimizer config object.

Parameters:
  • model_chunks (List[MegatronModule]) – model chunks to create parameter groups for.

  • config (OptimizerConfig) – optimizer configuration object.

  • config_overrides (Optional[Dict[LayerKey, OptimizerConfig]) – optimizer overrides, specified on a per-layer basis.

Returns:

List of parameter groups.

core.optimizer._get_param_groups_and_buffers(
model_chunks: List[core.transformer.module.MegatronModule],
model_chunk_offset: int,
config: core.optimizer.optimizer_config.OptimizerConfig,
config_overrides: Optional[Dict[core.optimizer.optimizer_config.ParamKey, core.optimizer.optimizer_config.OptimizerConfig]],
filter_fn: Callable,
buffer_name: str,
) Tuple[List[Dict], Dict[int, List[core.distributed.param_and_grad_buffer._ParamAndGradBuffer]]]#

Returns parameter groups and buffer for optimizer.

Parameters:
  • model_chunks (List[MegatronModule]) – model chunks to create parameter groups for.

  • model_chunk_offset (int) – offset of model_chunks in global model_chunks list.

  • config (OptimizerConfig) – optimizer configuration object.

  • config_overrides (Optional[Dict[LayerKey, OptimizerConfig]) – optimizer overrides, specified on a per-layer basis.

  • lr (float) – learning rate.

  • min_lr (float) – minimum learning rate.

  • filter_fn (callable) – filtering function for param_groups.

  • buffer_name (str) – name of buffer.

Returns:

List of parameter groups and dictionary of model chunk IDs to buffers.

core.optimizer._get_megatron_optimizer_based_on_param_groups(
config: core.optimizer.optimizer_config.OptimizerConfig,
model_chunks: List[core.transformer.module.MegatronModule],
param_groups: List,
per_model_buffers: Optional[Dict[int, List[core.distributed.param_and_grad_buffer._ParamAndGradBuffer]]] = None,
model_parallel_group: Optional[torch.distributed.ProcessGroup] = None,
data_parallel_group: Optional[torch.distributed.ProcessGroup] = None,
data_parallel_group_gloo: Optional[torch.distributed.ProcessGroup] = None,
data_parallel_group_idx: Optional[int] = None,
intra_dist_opt_group: Optional[torch.distributed.ProcessGroup] = None,
distributed_optimizer_instance_id: Optional[int] = 0,
) core.optimizer.optimizer.MegatronOptimizer#

Get Megatron optimizer based on parameter groups.

Parameters:
  • config (OptimizerConfig) – optimizer configuration object.

  • model_chunks (list) – list of model chunks.

  • param_groups (list) – list of parameter groups.

  • per_model_buffers (dict, optional) – buffers for distributed optimizer. Defaults to None.

  • data_parallel_group (torch.distributed.ProcessGroup, optional) – data-parallel group for distributed optimizer. Defaults to None.

  • data_parallel_group_gloo (torch.distributed.ProcessGroup, optional) – gloo data-parallel group for distributed optimizer. Defaults to None.

  • data_parallel_group_idx (int, optional) – data-parallel group index for distributed optimizer. Defaults to None.

  • distributed_optimizer_instance_id (int, optional) – Distributed optimizer instance. Defaults 0.

Returns:

Instance of MegatronOptimizer.

core.optimizer.get_megatron_optimizer(
config: core.optimizer.optimizer_config.OptimizerConfig,
model_chunks: List[core.transformer.module.MegatronModule],
config_overrides: Optional[Dict[core.optimizer.optimizer_config.ParamKey, core.optimizer.optimizer_config.OptimizerConfig]] = None,
use_gloo_process_groups: bool = True,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
dump_param_to_param_group_map: Optional[str] = None,
) core.optimizer.optimizer.MegatronOptimizer#

Retrieve the Megatron optimizer for model chunks.

We use separate optimizers for expert parameters and non-expert parameters.

Parameters:
  • config (OptimizerConfig) – optimizer configuration object.

  • model_chunks (List[MegatronModule]) – model chunks to get optimizer for.

  • config_overrides (Optional[Dict[ParamKey, OptimizerConfig]]) – optional dictionary of optimizer configuration objects to override default optimizer behavior for different subsets of parameters (identified by ParamKey).

  • use_gloo_process_groups (bool) – if false, disable use of Gloo process groups in underlying Megatron optimizers.

  • pg_collection – Optional unified process group for distributed training.

  • dump_param_to_param_group_map (Optional[str]) – path to dump parameter to param group map.

Returns:

Instance of MegatronOptimizer.