core.optimizer.cpu_offloading.hybrid_optimizer#

Module Contents#

Classes#

HybridDeviceOptimizer

HybridDeviceOptimizer is a custom optimizer designed to facilitate hybrid parameter updates across GPU and CPU. This optimizer allows users to adjust the fraction of parameters updated on the CPU and GPU through the offload_fraction parameter.

Functions#

API#

core.optimizer.cpu_offloading.hybrid_optimizer._param_generator(cpu_optimizer)#
class core.optimizer.cpu_offloading.hybrid_optimizer.HybridDeviceOptimizer(
params,
offload_fraction=0.5,
cpu_optimizer_cls=None,
gpu_optimizer_cls=None,
param_update_in_fp32: bool = False,
pin_cpu_grads: bool = True,
pin_cpu_params: bool = True,
overlap_cpu_optimizer_d2h_h2d: bool = True,
**kwargs,
)#

Bases: torch.optim.Optimizer

HybridDeviceOptimizer is a custom optimizer designed to facilitate hybrid parameter updates across GPU and CPU. This optimizer allows users to adjust the fraction of parameters updated on the CPU and GPU through the offload_fraction parameter.

It supports bf16 mixed-precision training. Additionally, the optimizer implements overlapping operations for improved performance, including gradient transfer from device to host (D2H) and parameter transfer from host to device (H2D).

.. rubric:: Example

from transformer_engine.pytorch.optimizers import FusedAdam as GPUAdam from torch.optim import AdamW as CPUAdam optimizer = HybridDeviceOptimizer( param_groups, cpu_optimizer_cls=CPUAdam, gpu_optimizer_cls=GPUAdam, offload_fraction=0.5, param_update_in_fp32=True, overlap_cpu_optimizer_d2h_h2d=True, ) optimizer.step()

.. note::

This optimizer is particularly useful in scenarios where memory constraints are present or when leveraging both CPU and GPU resources can lead to performance improvements.

Initialization

_set_sub_optimizer_grads()#
_register_param_copy_back_gpu_hook()#
step(closure=None)#

Override the step method to perform the following operations: 1. Sync the HDO param_groups to sub-optimizers. 2. Sync the grads from GPU to CPU. 3. Step the sub-optimizers. 4. Sync the sub-optimizers state to HDO.

_init_sub_optimizers()#
static build_cpu_optimizer_list(cpu_optimizer_cls, cpu_param_groups)#

Build several cpu optimizers to enable overlap. Currently we naively assign each parameter to an individual optimizer.

Parameters:
  • cpu_optimizer_cls (Type[torch.optim.Optimizer]) – A torch optimizer class

  • cpu_param_groups (List[Dict[str, Any]]) – The CPU parameter groups

_get_sub_optimizer_param_groups(offload_fraction: float)#
_sync_sub_optimizers_state_to_hdo()#

Update HDO state attribute to sub-optimizers.

_sync_hdo_state_to_sub_optimizers()#
_sync_hdo_param_groups_to_sub_optimizers()#

Sync HDO new param_groups attribute (e.g. lr, wd, etc.) to sub-optimizers.

_move_new_state_to_right_device()#
_update_fp32_params_by_new_state()#
update_fp32_param_by_new_param()#

Update the fp32 parameters by the new parameters.

_register_load_state_dict_hooks()#
zero_grad(set_to_none: bool = True)#

Zero or zero to none the gradients of all the parameters in the model.

dummy_step()#

The dummy step can be used to initialize the potential optimizer.state, which can solve the problem of checkpoint loading for an inplace operation such as loading a torch distributed checkpoint, for example.

property sub_optimizers#

Return the list of sub-optimizers.