core.optimizer.cpu_offloading.hybrid_optimizer#
Module Contents#
Classes#
HybridDeviceOptimizer is a custom optimizer designed to facilitate
hybrid parameter updates across GPU and CPU. This optimizer allows
users to adjust the fraction of parameters updated on the CPU and
GPU through the |
Functions#
API#
- core.optimizer.cpu_offloading.hybrid_optimizer._param_generator(cpu_optimizer)#
- class core.optimizer.cpu_offloading.hybrid_optimizer.HybridDeviceOptimizer(
- params,
- offload_fraction=0.5,
- cpu_optimizer_cls=None,
- gpu_optimizer_cls=None,
- param_update_in_fp32: bool = False,
- pin_cpu_grads: bool = True,
- pin_cpu_params: bool = True,
- overlap_cpu_optimizer_d2h_h2d: bool = True,
- **kwargs,
Bases:
torch.optim.OptimizerHybridDeviceOptimizer is a custom optimizer designed to facilitate hybrid parameter updates across GPU and CPU. This optimizer allows users to adjust the fraction of parameters updated on the CPU and GPU through the
offload_fractionparameter.It supports bf16 mixed-precision training. Additionally, the optimizer implements overlapping operations for improved performance, including gradient transfer from device to host (D2H) and parameter transfer from host to device (H2D).
.. rubric:: Example
from transformer_engine.pytorch.optimizers import FusedAdam as GPUAdam from torch.optim import AdamW as CPUAdam optimizer = HybridDeviceOptimizer( param_groups, cpu_optimizer_cls=CPUAdam, gpu_optimizer_cls=GPUAdam, offload_fraction=0.5, param_update_in_fp32=True, overlap_cpu_optimizer_d2h_h2d=True, ) optimizer.step()
.. note::
This optimizer is particularly useful in scenarios where memory constraints are present or when leveraging both CPU and GPU resources can lead to performance improvements.
Initialization
- _set_sub_optimizer_grads()#
- _register_param_copy_back_gpu_hook()#
- step(closure=None)#
Override the step method to perform the following operations: 1. Sync the HDO param_groups to sub-optimizers. 2. Sync the grads from GPU to CPU. 3. Step the sub-optimizers. 4. Sync the sub-optimizers state to HDO.
- _init_sub_optimizers()#
- static build_cpu_optimizer_list(cpu_optimizer_cls, cpu_param_groups)#
Build several cpu optimizers to enable overlap. Currently we naively assign each parameter to an individual optimizer.
- Parameters:
cpu_optimizer_cls (Type[torch.optim.Optimizer]) – A torch optimizer class
cpu_param_groups (List[Dict[str, Any]]) – The CPU parameter groups
- _get_sub_optimizer_param_groups(offload_fraction: float)#
- _sync_sub_optimizers_state_to_hdo()#
Update HDO state attribute to sub-optimizers.
- _sync_hdo_state_to_sub_optimizers()#
- _sync_hdo_param_groups_to_sub_optimizers()#
Sync HDO new param_groups attribute (e.g. lr, wd, etc.) to sub-optimizers.
- _move_new_state_to_right_device()#
- _update_fp32_params_by_new_state()#
- update_fp32_param_by_new_param()#
Update the fp32 parameters by the new parameters.
- _register_load_state_dict_hooks()#
- zero_grad(set_to_none: bool = True)#
Zero or zero to none the gradients of all the parameters in the model.
- dummy_step()#
The dummy step can be used to initialize the potential optimizer.state, which can solve the problem of checkpoint loading for an inplace operation such as loading a torch distributed checkpoint, for example.
- property sub_optimizers#
Return the list of sub-optimizers.