core.transformer.moe.paged_stash#

Module Contents#

Classes#

PagedStashBuffer

A paged stash buffer with page-level memory management. Supports both CUDA and optional pinned host buffer for overflow fallback.

PagedTensor

A paged tensor that stores data in pages within a paged stash buffer.

PipelinePreScheduleFunction

This function is used to update the pp schedule.

PipelinePostScheduleFunction

This function is used to update the pp schedule.

PagedStashManager

Singleton manager for coordinating paged stashing across pipeline stages. Manages chunk handlers, synchronizes GPU-GPU transfers, and handles virtual pipeline parallelism

PagedStashContext

Wrapper context manager that adds custom enter/exit behavior around saved_tensors_hooks.

PagedStashRunner

Runner for paged stash

Functions#

paged_stash_group_start

Mark the start of a layer group and prepare for stash/reload.

get_paged_stash_context

Get the paged stash context

paged_stash_group_commit

Mark the end of a layer group and prepare for stash/reload.

paged_stash_init_chunk_handler

Initialize the chunk handler, called at the start of a microbatch forward pass.

paged_stash_reset

Reset the chunk handler, called at the start of a training iteration.

check_paged_stash_overflow

Check if paged stash overflow

check_paged_stash_host_spill

True if any activation was stashed to pinned host (successful spill, not overflow path).

Data#

API#

core.transformer.moe.paged_stash.logger#

‘getLogger(…)’

core.transformer.moe.paged_stash.SCALE_INV_BLOCK_SIZE#

32

class core.transformer.moe.paged_stash.PagedStashBuffer(
num_tokens,
hidden_size,
page_size,
device,
overflow,
host_spill,
dtype,
num_tokens_host=0,
)#

A paged stash buffer with page-level memory management. Supports both CUDA and optional pinned host buffer for overflow fallback.

Buffers are organized as [num_pages, page_size, hidden_size]. Uses per-buffer free lists (circular buffer) tracked as two-element state: [0]=CUDA, [1]=host.

Initialization

Parameters:
  • num_tokens – Maximum number of tokens the CUDA buffer can hold

  • hidden_size – Hidden dimension size

  • page_size – Number of tokens per page

  • device – Device for the buffer

  • overflow – Overflow flag tensor (shared across all buffers)

  • host_spill – Global flag set to 1 if any stash used pinned host (shared)

  • dtype – Data type

  • num_tokens_host – If > 0, allocate pinned host buffer with this many tokens for spillover.

reset()#

Reset both CUDA and host free lists (CUDA graph safe: no new allocations).

__repr__()#
class core.transformer.moe.paged_stash.PagedTensor(
tensor,
num_tokens_tensor=None,
avg_num_tokens: int = None,
vp_stage=None,
original_shape=None,
schedule_layer_no=None,
is_columnwise_scale_inv=None,
max_num_tokens=None,
hidden_size=None,
page_size=64,
)#

A paged tensor that stores data in pages within a paged stash buffer.

Initialization

Parameters:
  • tensor – The tensor to store

  • num_tokens_tensor – Scalar tensor containing actual number of tokens

  • vp_stage – Virtual pipeline stage

  • layer_name – Name of the layer

  • max_num_tokens – Maximum number of tokens

  • hidden_size – Hidden size

  • page_size – Number of tokens per page

property schedule_layer#

Get the schedule layer.

offload_to_stash(
paged_stash_buffer: core.transformer.moe.paged_stash.PagedStashBuffer,
max_blocks=2048,
)#

Offload the paged tensor to paged stash buffer (CUDA or host if CUDA full).

reload_from_stash(
paged_stash_buffer: core.transformer.moe.paged_stash.PagedStashBuffer,
max_blocks=2048,
)#

Reload the paged tensor from paged stash buffer (CUDA or host from spilled_to_host).

_tensor must already be allocated on the main (default) stream by the caller; this method only enqueues unpack-stream kernels that fill it from the stash.

class core.transformer.moe.paged_stash.PipelinePreScheduleFunction#

Bases: torch.autograd.Function

This function is used to update the pp schedule.

static forward(ctx, tensor, stash_manager)#
static backward(ctx, *grad_output)#
class core.transformer.moe.paged_stash.PipelinePostScheduleFunction#

Bases: torch.autograd.Function

This function is used to update the pp schedule.

static forward(ctx, tensor, stash_manager)#
static backward(ctx, *grad_output)#
class core.transformer.moe.paged_stash.PagedStashManager#

Singleton manager for coordinating paged stashing across pipeline stages. Manages chunk handlers, synchronizes GPU-GPU transfers, and handles virtual pipeline parallelism

Initialization

Initialize the manager with queues and dedicated CUDA streams.

STASH_MGR#

None

classmethod get_instance()#

Get the singleton instance of PagedStashManager.

property pack_stream#

Get the pack stream.

property unpack_stream#

Get the unpack stream.

set_current_layer_name(name)#

Set the current layer name.

get_schedule_layer(vp_stage, layer_no, microbatch_no)#

Get the schedule layer.

add_paged_tensor_to_stash(paged_tensor)#

Add a paged tensor to the stash list.

remove_paged_tensor_from_stash()#

Remove all paged tensors from the stash list.

stash_paged_tensors(pp_schedule_layer)#

Stash the paged tensors.

wait_for_stash_to_complete()#

Wait for stash to complete.

reload_paged_tensors(pp_schedule_layer, no_wait=False)#

Reload the paged tensors.

allocate_stash_buffers(
moe_paged_stash_buffer_size_factor_cuda: float = 1.1,
moe_paged_stash_buffer_size_factor_cpu: float = 0.0,
)#

Allocate stash buffers organized by [dtype][hidden_size].

release_stash_buffers()#

Drop large stash CUDA/host page buffers after full-iteration CUDA graph teardown (fallback).

Shared overflow / host_spill scalars are retained (small). Reallocation of page buffers happens on the next paged_stash_reset while status remains captured.

update_pp_schedule(vp_stage, layer_no=None, microbatch_no=None)#

Update the pp schedule.

update_model_chunk(vp_stage_index)#

Update layer=1, increment microbatch of new vp vp_stage.

on_save_for_backward(tensor: torch.Tensor) Any#

Hook called when autograd saves a tensor for backward pass. Returns a tag to identify the tensor later.

on_get_saved_tensor(saved_state: Any) torch.Tensor#

Hook called when autograd retrieves a saved tensor during backward pass. Returns the actual tensor (potentially reloading from CPU).

class core.transformer.moe.paged_stash.PagedStashContext(stash_manager)#

Wrapper context manager that adds custom enter/exit behavior around saved_tensors_hooks.

Initialization

__enter__()#
__exit__(*args: Any)#
core.transformer.moe.paged_stash.paged_stash_group_start(tensor)#

Mark the start of a layer group and prepare for stash/reload.

core.transformer.moe.paged_stash.get_paged_stash_context(
name=None,
max_num_tokens=None,
num_tokens_tensor=None,
avg_num_tokens=None,
)#

Get the paged stash context

core.transformer.moe.paged_stash.paged_stash_group_commit(tensor, name=None)#

Mark the end of a layer group and prepare for stash/reload.

core.transformer.moe.paged_stash.paged_stash_init_chunk_handler(vp_size, vp_stage)#

Initialize the chunk handler, called at the start of a microbatch forward pass.

core.transformer.moe.paged_stash.paged_stash_reset(enabled=True, config=None)#

Reset the chunk handler, called at the start of a training iteration.

config: optional TransformerConfig; if provided, moe_paged_stash_buffer_size_factor_cuda/cpu and moe_paged_stash_page_size are read from it. Otherwise defaults to 1.10 (CUDA), 0.0 (CPU).

core.transformer.moe.paged_stash.check_paged_stash_overflow()#

Check if paged stash overflow

core.transformer.moe.paged_stash.check_paged_stash_host_spill()#

True if any activation was stashed to pinned host (successful spill, not overflow path).

class core.transformer.moe.paged_stash.PagedStashRunner(
config,
copy_main_params,
model,
optimizer,
forward_backward_func,
)#

Runner for paged stash

Initialization

_set_moe_paged_stash_all(value: bool) None#

Set moe_paged_stash on every tracked config (train + per VP chunk root).

data_read(data_iterator, model, training, num_microbatches)#

Read all microbatch inputs from Dataloader and copy to static buffers.

check_moe_overflow()#

(stash_overflow_rank_sum, overbudget_rank_sum, host_spill_rank_sum); one all_reduce.

prepare_for_rerun(is_training=True)#

Prepare for rerun

__call__(*args, **kwargs)#

Training-step wrapper with fallback when static-buffer paths overflow.

The first attempt runs forward/backward with a static HybridEP receive budget (moe_expert_rank_capacity_factor) and paged stashing enabled. If either the HybridEP permute buffer is over budget (tokens dropped) or the paged stash buffer overflows, this wrapper retries once in synced dropless mode: no static limit on HybridEP (capacity factor cleared, dynamic permuted size via CPU sync) and no paged stash (moe_paged_stash disabled).

At most two attempts. Each attempt prefetches microbatches, runs the schedule, then all-reduces stash overflow, HybridEP over-budget, and host spill across ranks. On success, restore capacity factor and moe_paged_stash for the next step. On overflow, prepare_for_rerun resets grads and the CUDA graph before retry.