core.optimizer.param_layout#
Parameter layout dataclasses for optimizer-driven buffer layout.
These dataclasses describe how parameters are laid out in contiguous buffers. Each distributed optimizer implementation (e.g., DistributedOptimizer) is responsible for computing these layouts via a _compute_per_buffer_param_layout method, applying its own padding, alignment, and bucket splitting rules. DDP and buffers consume the resulting layouts without any optimizer-specific knowledge.
Module Contents#
Classes#
Identifies a distinct parameter buffer. |
|
Layout for parameters within a single contiguous buffer. |
|
Layout for all parameters across all buffer groups in a model chunk. |
Functions#
Round up |
|
Align parameter start index to a 64-element boundary. |
|
Pad bucket end for DP-divisibility (and optionally high NCCL bus bandwidth). |
API#
- core.optimizer.param_layout.pad_to_divisor(value: int, divisor: int) int#
Round up
valueto the nearest multiple ofdivisor.
- core.optimizer.param_layout.pad_param_start(param_start_index: int) int#
Align parameter start index to a 64-element boundary.
- core.optimizer.param_layout.pad_bucket_end(
- bucket_end_index: int,
- data_parallel_world_size: int,
- pad_for_high_nccl_busbw: bool,
Pad bucket end for DP-divisibility (and optionally high NCCL bus bandwidth).
- class core.optimizer.param_layout.BufferKey#
Identifies a distinct parameter buffer.
Each unique combination of these fields corresponds to a separate contiguous buffer in DDP. Parameters are grouped into buffers by these dimensions.
.. attribute:: param_dtype
Storage dtype (torch.uint8 for FP8/NVFP4 parameters, else param.dtype).
.. attribute:: grad_dtype
Gradient reduction dtype.
.. attribute:: is_expert_parallel
Whether the buffer holds expert-parallel parameters, which use a separate data-parallel group.
- param_dtype: torch.dtype#
None
- grad_dtype: torch.dtype#
None
- is_expert_parallel: bool#
None
- class core.optimizer.param_layout.PerBufferParamLayout#
Layout for parameters within a single contiguous buffer.
Describes how parameters should be laid out in the contiguous buffer.
.. attribute:: param_index_map
Mapping from parameter to (start_index, end_index, bucket_id) in buffer.
.. attribute:: bucket_indices
List of (start_index, end_index) for each bucket.
.. attribute:: per_bucket_numel_unpadded
Number of unpadded elements per bucket.
.. attribute:: param_indices
The index of each param among same-dtype params (using the “fake” high-precision dtype for FP8/NVFP4 params). Needed for loading non-native-fp8 checkpoints in native-fp8 mode. Order matches param_index_map iteration order.
- param_index_map: Dict[torch.nn.Parameter, Tuple[int, int, int]]#
‘field(…)’
- bucket_indices: List[Tuple[int, int]]#
‘field(…)’
- per_bucket_numel_unpadded: List[int]#
‘field(…)’
- param_indices: List[int]#
‘field(…)’
- class core.optimizer.param_layout.FullParamLayout#
Layout for all parameters across all buffer groups in a model chunk.
Maps BufferKey to per-buffer PerBufferParamLayout objects. Each PerBufferParamLayout has its own independent index space since different buffer groups are physically separate buffers.
.. attribute:: layouts
Mapping from BufferKey to PerBufferParamLayout.
- layouts: Dict[core.optimizer.param_layout.BufferKey, core.optimizer.param_layout.PerBufferParamLayout]#
‘field(…)’