`core.transformer.moe.paged_stash`#

Module Contents#

Classes#

`PagedStashBuffer`	A paged stash buffer with page-level memory management. Supports both CUDA and optional pinned host buffer for overflow fallback.
`PagedTensor`	A paged tensor that stores data in pages within a paged stash buffer.
`PipelinePreScheduleFunction`	This function is used to update the pp schedule.
`PipelinePostScheduleFunction`	This function is used to update the pp schedule.
`PagedStashManager`	Singleton manager for coordinating paged stashing across pipeline stages. Manages chunk handlers, synchronizes GPU-GPU transfers, and handles virtual pipeline parallelism
`PagedStashContext`	Wrapper context manager that adds custom enter/exit behavior around saved_tensors_hooks.
`PagedStashRunner`	Runner for paged stash

Functions#

`paged_stash_group_start`	Mark the start of a layer group and prepare for stash/reload.
`get_paged_stash_context`	Get the paged stash context
`paged_stash_group_commit`	Mark the end of a layer group and prepare for stash/reload.
`paged_stash_init_chunk_handler`	Initialize the chunk handler, called at the start of a microbatch forward pass.
`paged_stash_reset`	Reset the chunk handler, called at the start of a training iteration.
`check_paged_stash_overflow`	Check if paged stash overflow
`check_paged_stash_host_spill`	True if any activation was stashed to pinned host (successful spill, not overflow path).

Data#

`logger`
`SCALE_INV_BLOCK_SIZE`

API#

core.transformer.moe.paged_stash.logger#: ‘getLogger(…)’

core.transformer.moe.paged_stash.SCALE_INV_BLOCK_SIZE#: 32

class core.transformer.moe.paged_stash.PagedStashBuffer( num_tokens, hidden_size, page_size, device, overflow, host_spill, dtype, num_tokens_host=0, )#

A paged stash buffer with page-level memory management. Supports both CUDA and optional pinned host buffer for overflow fallback.

Buffers are organized as [num_pages, page_size, hidden_size]. Uses per-buffer free lists (circular buffer) tracked as two-element state: [0]=CUDA, [1]=host.

Initialization

Parameters:

num_tokens – Maximum number of tokens the CUDA buffer can hold
hidden_size – Hidden dimension size
page_size – Number of tokens per page
device – Device for the buffer
overflow – Overflow flag tensor (shared across all buffers)
host_spill – Global flag set to 1 if any stash used pinned host (shared)
dtype – Data type
num_tokens_host – If > 0, allocate pinned host buffer with this many tokens for spillover.

reset()#: Reset both CUDA and host free lists (CUDA graph safe: no new allocations).

__repr__()#

class core.transformer.moe.paged_stash.PagedTensor( tensor, num_tokens_tensor=None, avg_num_tokens: int = None, vp_stage=None, original_shape=None, schedule_layer_no=None, is_columnwise_scale_inv=None, max_num_tokens=None, hidden_size=None, page_size=64, )#

A paged tensor that stores data in pages within a paged stash buffer.

Initialization

Parameters:

tensor – The tensor to store
num_tokens_tensor – Scalar tensor containing actual number of tokens
vp_stage – Virtual pipeline stage
layer_name – Name of the layer
max_num_tokens – Maximum number of tokens
hidden_size – Hidden size
page_size – Number of tokens per page

property schedule_layer#: Get the schedule layer.

offload_to_stash( paged_stash_buffer: core.transformer.moe.paged_stash.PagedStashBuffer, max_blocks=2048, )#: Offload the paged tensor to paged stash buffer (CUDA or host if CUDA full).

reload_from_stash( paged_stash_buffer: core.transformer.moe.paged_stash.PagedStashBuffer, max_blocks=2048, )#

Reload the paged tensor from paged stash buffer (CUDA or host from spilled_to_host).

_tensor must already be allocated on the main (default) stream by the caller; this method only enqueues unpack-stream kernels that fill it from the stash.

class core.transformer.moe.paged_stash.PipelinePreScheduleFunction#

Bases: torch.autograd.Function

This function is used to update the pp schedule.

static forward(ctx, tensor, stash_manager)#

static backward(ctx, *grad_output)#

class core.transformer.moe.paged_stash.PipelinePostScheduleFunction#

Bases: torch.autograd.Function

This function is used to update the pp schedule.

static forward(ctx, tensor, stash_manager)#

static backward(ctx, *grad_output)#

class core.transformer.moe.paged_stash.PagedStashManager#

Singleton manager for coordinating paged stashing across pipeline stages. Manages chunk handlers, synchronizes GPU-GPU transfers, and handles virtual pipeline parallelism

Initialization

Initialize the manager with queues and dedicated CUDA streams.

STASH_MGR#: None

classmethod get_instance()#: Get the singleton instance of PagedStashManager.

property pack_stream#: Get the pack stream.

property unpack_stream#: Get the unpack stream.

set_current_layer_name(name)#: Set the current layer name.

get_schedule_layer(vp_stage, layer_no, microbatch_no)#: Get the schedule layer.

add_paged_tensor_to_stash(paged_tensor)#: Add a paged tensor to the stash list.

remove_paged_tensor_from_stash()#: Remove all paged tensors from the stash list.

stash_paged_tensors(pp_schedule_layer)#: Stash the paged tensors.

wait_for_stash_to_complete()#: Wait for stash to complete.

reload_paged_tensors(pp_schedule_layer, no_wait=False)#: Reload the paged tensors.

allocate_stash_buffers( moe_paged_stash_buffer_size_factor_cuda: float = 1.1, moe_paged_stash_buffer_size_factor_cpu: float = 0.0, )#: Allocate stash buffers organized by [dtype][hidden_size].

release_stash_buffers()#

Drop large stash CUDA/host page buffers after full-iteration CUDA graph teardown (fallback).

Shared overflow / host_spill scalars are retained (small). Reallocation of page buffers happens on the next paged_stash_reset while status remains captured.

update_pp_schedule(vp_stage, layer_no=None, microbatch_no=None)#: Update the pp schedule.

update_model_chunk(vp_stage_index)#: Update layer=1, increment microbatch of new vp vp_stage.

on_save_for_backward(tensor: torch.Tensor) → Any#: Hook called when autograd saves a tensor for backward pass. Returns a tag to identify the tensor later.

on_get_saved_tensor(saved_state: Any) → torch.Tensor#: Hook called when autograd retrieves a saved tensor during backward pass. Returns the actual tensor (potentially reloading from CPU).

class core.transformer.moe.paged_stash.PagedStashContext(stash_manager)#

Wrapper context manager that adds custom enter/exit behavior around saved_tensors_hooks.

Initialization

__enter__()#

__exit__(*args: Any)#

core.transformer.moe.paged_stash.paged_stash_group_start(tensor)#: Mark the start of a layer group and prepare for stash/reload.

core.transformer.moe.paged_stash.get_paged_stash_context( name=None, max_num_tokens=None, num_tokens_tensor=None, avg_num_tokens=None, )#: Get the paged stash context

core.transformer.moe.paged_stash.paged_stash_group_commit(tensor, name=None)#: Mark the end of a layer group and prepare for stash/reload.

core.transformer.moe.paged_stash.paged_stash_init_chunk_handler(vp_size, vp_stage)#: Initialize the chunk handler, called at the start of a microbatch forward pass.

core.transformer.moe.paged_stash.paged_stash_reset(enabled=True, config=None)#

Reset the chunk handler, called at the start of a training iteration.

config: optional TransformerConfig; if provided, moe_paged_stash_buffer_size_factor_cuda/cpu and moe_paged_stash_page_size are read from it. Otherwise defaults to 1.10 (CUDA), 0.0 (CPU).

core.transformer.moe.paged_stash.check_paged_stash_overflow()#: Check if paged stash overflow

core.transformer.moe.paged_stash.check_paged_stash_host_spill()#: True if any activation was stashed to pinned host (successful spill, not overflow path).

class core.transformer.moe.paged_stash.PagedStashRunner( config, copy_main_params, model, optimizer, forward_backward_func, )#

Runner for paged stash

Initialization

_set_moe_paged_stash_all(value: bool) → None#: Set moe_paged_stash on every tracked config (train + per VP chunk root).

data_read(data_iterator, model, training, num_microbatches)#: Read all microbatch inputs from Dataloader and copy to static buffers.

check_moe_overflow()#: (stash_overflow_rank_sum, overbudget_rank_sum, host_spill_rank_sum); one all_reduce.

prepare_for_rerun(is_training=True)#: Prepare for rerun

__call__(*args, **kwargs)#

Training-step wrapper with fallback when static-buffer paths overflow.

The first attempt runs forward/backward with a static HybridEP receive budget (moe_expert_rank_capacity_factor) and paged stashing enabled. If either the HybridEP permute buffer is over budget (tokens dropped) or the paged stash buffer overflows, this wrapper retries once in synced dropless mode: no static limit on HybridEP (capacity factor cleared, dynamic permuted size via CPU sync) and no paged stash (moe_paged_stash disabled).

At most two attempts. Each attempt prefetches microbatches, runs the schedule, then all-reduces stash overflow, HybridEP over-budget, and host spill across ranks. On success, restore capacity factor and moe_paged_stash for the next step. On overflow, prepare_for_rerun resets grads and the CUDA graph before retry.

core.transformer.moe.paged_stash#

Module Contents#

Classes#

Functions#

Data#

API#

`core.transformer.moe.paged_stash`#