core.transformer.moe.paged_stash#
Module Contents#
Classes#
A paged stash buffer with page-level memory management. Supports both CUDA and optional pinned host buffer for overflow fallback. |
|
A paged tensor that stores data in pages within a paged stash buffer. |
|
This function is used to update the pp schedule. |
|
This function is used to update the pp schedule. |
|
Singleton manager for coordinating paged stashing across pipeline stages. Manages chunk handlers, synchronizes GPU-GPU transfers, and handles virtual pipeline parallelism |
|
Wrapper context manager that adds custom enter/exit behavior around saved_tensors_hooks. |
|
Runner for paged stash |
Functions#
Mark the start of a layer group and prepare for stash/reload. |
|
Get the paged stash context |
|
Mark the end of a layer group and prepare for stash/reload. |
|
Initialize the chunk handler, called at the start of a microbatch forward pass. |
|
Reset the chunk handler, called at the start of a training iteration. |
|
Check if paged stash overflow |
|
True if any activation was stashed to pinned host (successful spill, not overflow path). |
Data#
API#
- core.transformer.moe.paged_stash.logger#
‘getLogger(…)’
- core.transformer.moe.paged_stash.SCALE_INV_BLOCK_SIZE#
32
- class core.transformer.moe.paged_stash.PagedStashBuffer(
- num_tokens,
- hidden_size,
- page_size,
- device,
- overflow,
- host_spill,
- dtype,
- num_tokens_host=0,
A paged stash buffer with page-level memory management. Supports both CUDA and optional pinned host buffer for overflow fallback.
Buffers are organized as [num_pages, page_size, hidden_size]. Uses per-buffer free lists (circular buffer) tracked as two-element state: [0]=CUDA, [1]=host.
Initialization
- Parameters:
num_tokens – Maximum number of tokens the CUDA buffer can hold
hidden_size – Hidden dimension size
page_size – Number of tokens per page
device – Device for the buffer
overflow – Overflow flag tensor (shared across all buffers)
host_spill – Global flag set to 1 if any stash used pinned host (shared)
dtype – Data type
num_tokens_host – If > 0, allocate pinned host buffer with this many tokens for spillover.
- reset()#
Reset both CUDA and host free lists (CUDA graph safe: no new allocations).
- __repr__()#
- class core.transformer.moe.paged_stash.PagedTensor(
- tensor,
- num_tokens_tensor=None,
- avg_num_tokens: int = None,
- vp_stage=None,
- original_shape=None,
- schedule_layer_no=None,
- is_columnwise_scale_inv=None,
- max_num_tokens=None,
- hidden_size=None,
- page_size=64,
A paged tensor that stores data in pages within a paged stash buffer.
Initialization
- Parameters:
tensor – The tensor to store
num_tokens_tensor – Scalar tensor containing actual number of tokens
vp_stage – Virtual pipeline stage
layer_name – Name of the layer
max_num_tokens – Maximum number of tokens
hidden_size – Hidden size
page_size – Number of tokens per page
- property schedule_layer#
Get the schedule layer.
- offload_to_stash(
- paged_stash_buffer: core.transformer.moe.paged_stash.PagedStashBuffer,
- max_blocks=2048,
Offload the paged tensor to paged stash buffer (CUDA or host if CUDA full).
- reload_from_stash(
- paged_stash_buffer: core.transformer.moe.paged_stash.PagedStashBuffer,
- max_blocks=2048,
Reload the paged tensor from paged stash buffer (CUDA or host from spilled_to_host).
_tensormust already be allocated on the main (default) stream by the caller; this method only enqueues unpack-stream kernels that fill it from the stash.
- class core.transformer.moe.paged_stash.PipelinePreScheduleFunction#
Bases:
torch.autograd.FunctionThis function is used to update the pp schedule.
- static forward(ctx, tensor, stash_manager)#
- static backward(ctx, *grad_output)#
- class core.transformer.moe.paged_stash.PipelinePostScheduleFunction#
Bases:
torch.autograd.FunctionThis function is used to update the pp schedule.
- static forward(ctx, tensor, stash_manager)#
- static backward(ctx, *grad_output)#
- class core.transformer.moe.paged_stash.PagedStashManager#
Singleton manager for coordinating paged stashing across pipeline stages. Manages chunk handlers, synchronizes GPU-GPU transfers, and handles virtual pipeline parallelism
Initialization
Initialize the manager with queues and dedicated CUDA streams.
- STASH_MGR#
None
- classmethod get_instance()#
Get the singleton instance of PagedStashManager.
- property pack_stream#
Get the pack stream.
- property unpack_stream#
Get the unpack stream.
- set_current_layer_name(name)#
Set the current layer name.
- get_schedule_layer(vp_stage, layer_no, microbatch_no)#
Get the schedule layer.
- add_paged_tensor_to_stash(paged_tensor)#
Add a paged tensor to the stash list.
- remove_paged_tensor_from_stash()#
Remove all paged tensors from the stash list.
- stash_paged_tensors(pp_schedule_layer)#
Stash the paged tensors.
- wait_for_stash_to_complete()#
Wait for stash to complete.
- reload_paged_tensors(pp_schedule_layer, no_wait=False)#
Reload the paged tensors.
- allocate_stash_buffers(
- moe_paged_stash_buffer_size_factor_cuda: float = 1.1,
- moe_paged_stash_buffer_size_factor_cpu: float = 0.0,
Allocate stash buffers organized by [dtype][hidden_size].
- release_stash_buffers()#
Drop large stash CUDA/host page buffers after full-iteration CUDA graph teardown (fallback).
Shared
overflow/host_spillscalars are retained (small). Reallocation of page buffers happens on the nextpaged_stash_resetwhile status remainscaptured.
- update_pp_schedule(vp_stage, layer_no=None, microbatch_no=None)#
Update the pp schedule.
- update_model_chunk(vp_stage_index)#
Update layer=1, increment microbatch of new vp vp_stage.
- on_save_for_backward(tensor: torch.Tensor) Any#
Hook called when autograd saves a tensor for backward pass. Returns a tag to identify the tensor later.
- on_get_saved_tensor(saved_state: Any) torch.Tensor#
Hook called when autograd retrieves a saved tensor during backward pass. Returns the actual tensor (potentially reloading from CPU).
- class core.transformer.moe.paged_stash.PagedStashContext(stash_manager)#
Wrapper context manager that adds custom enter/exit behavior around saved_tensors_hooks.
Initialization
- __enter__()#
- __exit__(*args: Any)#
- core.transformer.moe.paged_stash.paged_stash_group_start(tensor)#
Mark the start of a layer group and prepare for stash/reload.
- core.transformer.moe.paged_stash.get_paged_stash_context(
- name=None,
- max_num_tokens=None,
- num_tokens_tensor=None,
- avg_num_tokens=None,
Get the paged stash context
- core.transformer.moe.paged_stash.paged_stash_group_commit(tensor, name=None)#
Mark the end of a layer group and prepare for stash/reload.
- core.transformer.moe.paged_stash.paged_stash_init_chunk_handler(vp_size, vp_stage)#
Initialize the chunk handler, called at the start of a microbatch forward pass.
- core.transformer.moe.paged_stash.paged_stash_reset(enabled=True, config=None)#
Reset the chunk handler, called at the start of a training iteration.
config: optional TransformerConfig; if provided, moe_paged_stash_buffer_size_factor_cuda/cpu and moe_paged_stash_page_size are read from it. Otherwise defaults to 1.10 (CUDA), 0.0 (CPU).
- core.transformer.moe.paged_stash.check_paged_stash_overflow()#
Check if paged stash overflow
- core.transformer.moe.paged_stash.check_paged_stash_host_spill()#
True if any activation was stashed to pinned host (successful spill, not overflow path).
- class core.transformer.moe.paged_stash.PagedStashRunner(
- config,
- copy_main_params,
- model,
- optimizer,
- forward_backward_func,
Runner for paged stash
Initialization
- _set_moe_paged_stash_all(value: bool) None#
Set moe_paged_stash on every tracked config (train + per VP chunk root).
- data_read(data_iterator, model, training, num_microbatches)#
Read all microbatch inputs from Dataloader and copy to static buffers.
- check_moe_overflow()#
(stash_overflow_rank_sum, overbudget_rank_sum, host_spill_rank_sum); one all_reduce.
- prepare_for_rerun(is_training=True)#
Prepare for rerun
- __call__(*args, **kwargs)#
Training-step wrapper with fallback when static-buffer paths overflow.
The first attempt runs forward/backward with a static HybridEP receive budget (moe_expert_rank_capacity_factor) and paged stashing enabled. If either the HybridEP permute buffer is over budget (tokens dropped) or the paged stash buffer overflows, this wrapper retries once in synced dropless mode: no static limit on HybridEP (capacity factor cleared, dynamic permuted size via CPU sync) and no paged stash (moe_paged_stash disabled).
At most two attempts. Each attempt prefetches microbatches, runs the schedule, then all-reduces stash overflow, HybridEP over-budget, and host spill across ranks. On success, restore capacity factor and moe_paged_stash for the next step. On overflow, prepare_for_rerun resets grads and the CUDA graph before retry.