`bridge.peft.utils`#

Module Contents#

Classes#

`AdapterAttributes`	Container for base linear adapter attributes.
`_All2AllHp2Sp`	All-2-All from Hidden Parallel to Sequence Parallel.
`ParallelLinearAdapter`	Parallel Linear Adapter for Parameter-Efficient Fine-Tuning (PEFT) in distributed settings.
`_GroupedExpertAdapterWeight`	Callable parameter container so DDP forward pre-hooks see grouped LoRA weights.
`GroupedExpertLinearAdapter`	LoRA adapter with one low-rank pair per local grouped MoE expert.
`PackedPerExpertLinear`	Per-expert linear with a packed 3D weight `[N_local, out, in]`.
`SharedOuterGroupedExpertAdapter`	LoRA adapter for grouped expert MLP with shared-outer semantics.

Functions#

`_te_grouped_linear_uses_explicit_m_splits`	Return whether TE’s grouped-linear autograd takes a separate splits tensor.
`_get_pg_collection_from_module`	Return the process-group collection attached to a module or its config.
`_get_pg_collection`	Return the explicit PG collection or MCore’s default collection fallback.
`_iter_sharded_tensor_factories`	Return all sharded tensor factories in a nested state dict.
`_checkpoint_tensor_shape`	Return checkpoint global tensor shape for a key, tolerating model-section prefixes.
`_legacy_shared_expert_adapter_key`	Return the adapter module key if a factory represents a shared expert LoRA tensor.
`_legacy_shared_expert_adapter_matches`	Return adapter modules matching a legacy shared-expert checkpoint key.
`enable_legacy_shared_expert_adapter_loading`	Enable legacy 2D checkpoint loading for old shared grouped-expert adapters.
`_get_process_group`	Return the first named process group available on a collection.
`_process_group_size`	Return a process-group size without consulting global parallel state.
`_process_group_rank`	Return this rank within a process group without consulting global parallel state.
`_get_tensor_parallel_group`	Return the tensor-parallel group for dense or expert linear layers.
`_get_tensor_parallel_group_from_module`	Return the TP group passed to the wrapped module, falling back to its collection.
`create_peft`	Create a Bridge PEFT object from a small config mapping.
`create_peft_hook`	Create a provider pre-wrap hook that applies PEFT.
`load_peft_adapter_checkpoint`	Load a PEFT adapter checkpoint into an already transformed model.
`_apply_peft`	Apply PEFT and mark adapter parameters for checkpointing.
`_import_peft_class`
`_model_state_dict`	Generate Bridge model checkpoint sections for an external trainer.
`_ensure_model_list`
`is_modelopt_linear`	Return whether a module is ModelOpt’s local Megatron Linear.
`get_adapter_attributes_from_linear`	Returns attributes from the base layer as an AdapterAttributes dataclass.
`is_expert_linear`	Return whether the current base module is an expert linear module.
`is_grouped_expert_linear`	Return whether the current base module is a grouped expert linear module.
`get_effective_lora_dim`	Return the LoRA rank to use, reduced for expert layers when `normalize_moe_lora` is enabled.
`align_expert_dim_for_tp`	Round normalized expert LoRA ranks up to the expert-TP granularity when needed.
`wildcard_match`	Return whether the pattern (target module to add LoRA) matches the key (model weight name).
`init_method_normal`	Create an initialization method based on normal distribution N(0, sigma).
`init_method_kaiming_uniform`	Create an initialization method based on Kaiming uniform distribution.
`init_method_const`	Create an initialization method that sets all values to a constant.
`pad_seq_to_mult`	Pad sequence length to be a multiple of mult.
`unpad_seq_to_mult`	Remove sequence padding that was added by pad_seq_to_mult.
`all2all_hp2sp`	Perform All-to-All communication from Hidden Parallel to Sequence Parallel.
`_divide_exact`	Divide `value` by `divisor` and raise when the result would be fractional.
`_apply_grouped_expert_swiglu_sharded_factory`	Split grouped-expert SwiGLU tensors along the fused hidden axis for checkpointing.
`_append_rank_offset`	Append a sharding offset, combining fragmentations when the axis is already sharded.
`_make_grouped_expert_sharded_tensor`	Build a sharded tensor for packed grouped-expert weights.
`_make_cross_ep_replicated`	Mark a weight as logically replicated across the intra-PP-stage group.

Data#

`logger`
`ModelList`
`ModelHook`
`CheckpointPath`
`_LEGACY_SHARED_EXPERT_ADAPTER_CHECKPOINT_ATTR`
`HAVE_TE`
`TECL`
`TERL`

API#

bridge.peft.utils.logger#: ‘getLogger(…)’

bridge.peft.utils.ModelList#: None

bridge.peft.utils.ModelHook#: None

bridge.peft.utils.CheckpointPath#: None

bridge.peft.utils._LEGACY_SHARED_EXPERT_ADAPTER_CHECKPOINT_ATTR#: ‘use_legacy_shared_expert_adapter_checkpoint’

bridge.peft.utils.HAVE_TE#: ‘all(…)’

bridge.peft.utils._te_grouped_linear_uses_explicit_m_splits( autograd_function: type[torch.autograd.Function], ) → bool#: Return whether TE’s grouped-linear autograd takes a separate splits tensor.

bridge.peft.utils._get_pg_collection_from_module( module: object | None, ) → megatron.core.process_groups_config.ProcessGroupCollection | None#: Return the process-group collection attached to a module or its config.

bridge.peft.utils._get_pg_collection( pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None = None, source: object | None = None, *, required_pgs: List[str], ) → megatron.core.process_groups_config.ProcessGroupCollection | None#: Return the explicit PG collection or MCore’s default collection fallback.

bridge.peft.utils._iter_sharded_tensor_factories( state_dict: object, ) → list[megatron.core.dist_checkpointing.mapping.ShardedTensorFactory]#: Return all sharded tensor factories in a nested state dict.

bridge.peft.utils._checkpoint_tensor_shape( checkpoint_metadata: Mapping[str, megatron.core.dist_checkpointing.mapping.ShardedTensor], key: str, ) → tuple[int, ...] | None#: Return checkpoint global tensor shape for a key, tolerating model-section prefixes.

bridge.peft.utils._legacy_shared_expert_adapter_key( factory: megatron.core.dist_checkpointing.mapping.ShardedTensorFactory, ) → str | None#: Return the adapter module key if a factory represents a shared expert LoRA tensor.

bridge.peft.utils._legacy_shared_expert_adapter_matches( adapters_by_name: Mapping[str, ParallelLinearAdapter], adapter_key: str, ) → list[ParallelLinearAdapter]#: Return adapter modules matching a legacy shared-expert checkpoint key.

bridge.peft.utils.enable_legacy_shared_expert_adapter_loading( megatron_model: list[torch.nn.Module] | torch.nn.Module, sharded_state_dict: megatron.core.dist_checkpointing.mapping.ShardedStateDict, checkpoint_path: str | pathlib.Path, ) → bool#

Enable legacy 2D checkpoint loading for old shared grouped-expert adapters.

New shared grouped-expert LoRA checkpoints expose a leading global expert axis so they can be resharded across EP changes. Older checkpoints saved the same shared adapter as a plain 2D tensor. This helper detects that old metadata shape and marks only the matching shared adapter modules to emit the legacy 2D sharded state dict for loading.

Parameters:

megatron_model – Model module or model chunks containing PEFT adapters.
sharded_state_dict – Current adapter-only sharded state dict.
checkpoint_path – Distributed checkpoint directory to inspect.

Returns:

True if at least one shared expert adapter was marked for legacy loading.

bridge.peft.utils._get_process_group( pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None, *names: str, ) → object | None#: Return the first named process group available on a collection.

bridge.peft.utils._process_group_size(group: object | None, fallback: int = 1) → int#: Return a process-group size without consulting global parallel state.

bridge.peft.utils._process_group_rank(group: object | None, fallback: int = 0) → int#: Return this rank within a process group without consulting global parallel state.

bridge.peft.utils._get_tensor_parallel_group( pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None, *, is_expert: bool = False, ) → object | None#: Return the tensor-parallel group for dense or expert linear layers.

bridge.peft.utils._get_tensor_parallel_group_from_module( module: torch.nn.Module, *, is_expert: bool = False, pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None = None, ) → object | None#: Return the TP group passed to the wrapped module, falling back to its collection.

bridge.peft.utils.TECL#: ()

bridge.peft.utils.TERL#: ()

bridge.peft.utils.create_peft( config: Mapping[str, Any], dtype: torch.dtype | str | int | None = None, ) → object | None#: Create a Bridge PEFT object from a small config mapping.

bridge.peft.utils.create_peft_hook( peft: object, training: bool = True, ) → bridge.peft.utils.ModelHook#: Create a provider pre-wrap hook that applies PEFT.

bridge.peft.utils.load_peft_adapter_checkpoint( model: bridge.peft.utils.ModelList | megatron.core.transformer.module.MegatronModule, adapter_checkpoint_path: bridge.peft.utils.CheckpointPath, peft: object, strict: bool = False, model_sd_kwargs: Mapping[str, object] | None = None, ckpt_format: str = 'torch_dist', pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None = None, fully_parallel_load: bool = True, load_strategy: object | None = None, ) → None#: Load a PEFT adapter checkpoint into an already transformed model.

bridge.peft.utils._apply_peft( peft: object, model: bridge.peft.utils.ModelList, training: bool = True, ) → bridge.peft.utils.ModelList#: Apply PEFT and mark adapter parameters for checkpointing.

bridge.peft.utils._import_peft_class(peft_type: str) → type[Any]#

bridge.peft.utils._model_state_dict( model: bridge.peft.utils.ModelList, model_sd_kwargs: Mapping[str, object] | None = None, ckpt_format: str = 'torch_dist', pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None = None, ) → dict[str, Any]#: Generate Bridge model checkpoint sections for an external trainer.

bridge.peft.utils._ensure_model_list( model: bridge.peft.utils.ModelList | megatron.core.transformer.module.MegatronModule, ) → bridge.peft.utils.ModelList#

bridge.peft.utils.is_modelopt_linear(m: torch.nn.Module) → bool#: Return whether a module is ModelOpt’s local Megatron Linear.

class bridge.peft.utils.AdapterAttributes#

Container for base linear adapter attributes.

input_is_parallel: bool#: None

in_features: int#: None

out_features: int#: None

disable_tensor_parallel_comm: bool#: None

disable_sequence_parallel_comm: bool#: None

base_linear_is_parallel: bool#: None

bridge.peft.utils.get_adapter_attributes_from_linear( m: torch.nn.Module, is_expert: bool = False, pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None = None, sequence_parallel_input_regather: bool = False, ) → bridge.peft.utils.AdapterAttributes#

Returns attributes from the base layer as an AdapterAttributes dataclass.

input_is_parallel, in_features, out_features, disable_tensor_parallel_comm, disable_sequence_parallel_comm, base_linear_is_parallel

This function analyzes a linear module and extracts key attributes needed for adapter configuration, particularly for PEFT adapters in distributed training scenarios.

Parameters:

m – The linear module to analyze (should have a config attribute).
is_expert – Whether the linear belongs to an expert module.
pg_collection – Optional process-group collection associated with the module.
sequence_parallel_input_regather – Whether LoRA-A should re-gather its sequence-parallel input in backward.

Returns:

input_is_parallel: Whether the input is already parallelized
in_features: Input feature dimension
out_features: Output feature dimension
disable_tensor_parallel_comm: Whether to disable tensor parallel communication
disable_sequence_parallel_comm: Whether to disable sequence parallel communication
base_linear_is_parallel: Whether the base linear layer uses parallelization

Return type:

AdapterAttributes containing

Raises:

NotImplementedError – If the layer type is not recognized for LoRA adaptation.

bridge.peft.utils.is_expert_linear(fqn: str) → bool#

Return whether the current base module is an expert linear module.

This function checks if a fully qualified name (FQN) corresponds to an expert linear module in a Mixture of Experts (MoE) architecture.

Parameters:: fqn – Fully qualified name of the module.
Returns:: True if the module is an expert linear module, False otherwise.

.. rubric:: Example

is_expert_linear(“model.layers.0.mlp.experts.0.linear_fc1”) True is_expert_linear(“model.layers.0.mlp.linear_fc1”) False

bridge.peft.utils.is_grouped_expert_linear(fqn: str) → bool#: Return whether the current base module is a grouped expert linear module.

bridge.peft.utils.get_effective_lora_dim( module: torch.nn.Module, *, dim: int, normalize_moe_lora: bool, is_expert: bool, ) → int#: Return the LoRA rank to use, reduced for expert layers when normalize_moe_lora is enabled.

bridge.peft.utils.align_expert_dim_for_tp( module: torch.nn.Module, dim: int, *, normalize_moe_lora: bool, is_expert: bool, input_is_parallel: bool, pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None = None, ) → int#: Round normalized expert LoRA ranks up to the expert-TP granularity when needed.

bridge.peft.utils.wildcard_match( pattern: str, key: Optional[str], ) → Optional[bool]#

Return whether the pattern (target module to add LoRA) matches the key (model weight name).

This function performs wildcard matching using ‘*’ as a placeholder for any substring.

Parameters:

pattern – Pattern string with wildcards (*) to match against.
key – Key string to test against the pattern.

Returns:

True if the pattern matches the key, False if it doesn’t, None if key is None.

.. rubric:: Example

wildcard_match(”.layers.0..linear_qkv”, “decoder.layers.0.self_attention.linear_qkv”) True wildcard_match(”.layers.0..linear_qkv”, “decoder.layers.1.self_attention.linear_qkv”) False

bridge.peft.utils.init_method_normal( sigma: float, ) → Callable[[torch.Tensor], torch.Tensor]#

Create an initialization method based on normal distribution N(0, sigma).

Parameters:: sigma – Standard deviation for the normal distribution.
Returns:: Initialization function that applies normal distribution to a tensor.

bridge.peft.utils.init_method_kaiming_uniform( val: float, ) → Callable[[torch.Tensor], torch.Tensor]#

Create an initialization method based on Kaiming uniform distribution.

Parameters:: val – The ‘a’ parameter for Kaiming uniform initialization.
Returns:: Initialization function that applies Kaiming uniform distribution to a tensor.

bridge.peft.utils.init_method_const( val: float, ) → Callable[[torch.Tensor], torch.Tensor]#

Create an initialization method that sets all values to a constant.

Parameters:: val – Constant value to initialize the tensor with.
Returns:: Initialization function that sets tensor to constant value.

bridge.peft.utils.pad_seq_to_mult( x: torch.Tensor, mult: int, ) → Tuple[torch.Tensor, int]#

Pad sequence length to be a multiple of mult.

This function pads the first dimension of the tensor to ensure it’s divisible by mult. Used primarily for MoE (Mixture of Experts) operations that require specific sequence lengths.

Parameters:

x – Input tensor to pad.
mult – Multiple that the sequence length should be divisible by.

Returns:

Padded tensor
Number of padding elements added

Return type:

A tuple containing

bridge.peft.utils.unpad_seq_to_mult(x: torch.Tensor, pad_len: int) → torch.Tensor#

Remove sequence padding that was added by pad_seq_to_mult.

Parameters:

x – Padded tensor to unpad.
pad_len – Number of padding elements to remove from the end.

Returns:

Unpadded tensor with pad_len elements removed from the first dimension.

class bridge.peft.utils._All2AllHp2Sp#

Bases: torch.autograd.Function

All-2-All from Hidden Parallel to Sequence Parallel.

This is a temporary workaround for distributed communication patterns and can be updated in the future. It performs all-to-all communication to transform from hidden parallel to sequence parallel layout.

TODO: Move the functionality to MCore

static forward( ctx, input_: torch.Tensor, group: object | None, ) → torch.Tensor#

Forward pass: All-to-All from Hidden Parallel to Sequence Parallel.

Parameters:

ctx – Autograd context (unused but required by Function interface).
input_ – Input tensor in hidden parallel layout.

Returns:

Output tensor in sequence parallel layout.

static backward(ctx, grad_output: torch.Tensor) → torch.Tensor#

Backward pass: All-to-All from Sequence Parallel to Hidden Parallel.

Parameters:

ctx – Autograd context (unused but required by Function interface).
grad_output – Gradient tensor in sequence parallel layout.

Returns:

Gradient tensor in hidden parallel layout.

bridge.peft.utils.all2all_hp2sp( input_: torch.Tensor, tensor_parallel_group: object | None = None, ) → torch.Tensor#

Perform All-to-All communication from Hidden Parallel to Sequence Parallel.

Parameters:: input_ – Input tensor in hidden parallel layout.
Returns:: Output tensor in sequence parallel layout.

class bridge.peft.utils.ParallelLinearAdapter( in_features: int, out_features: int, dim: int, base_linear_name: str, activation: str = 'swish', column_init_method: str = 'xavier', row_init_method: str = 'zero', input_is_parallel: bool = False, dropout: float = 0.0, model_parallel_config: Optional[megatron.core.ModelParallelConfig] = None, alpha: Optional[float] = None, dropout_position: str = 'pre', a2a_experimental: bool = False, is_expert: bool = False, disable_tensor_parallel_comm: bool = False, disable_sequence_parallel_comm: bool = True, base_linear_is_parallel: bool = True, pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None = None, sequence_parallel_input_regather: bool = False, )#

Bases: torch.nn.Module

Parallel Linear Adapter for Parameter-Efficient Fine-Tuning (PEFT) in distributed settings.

This adapter implements a low-rank adaptation pattern using two linear layers with configurable parallelization strategies. It supports both tensor and sequence parallelism patterns used in large language model training.

The adapter follows the pattern: input -> linear_in -> activation -> linear_out -> scaling where linear_in and linear_out are parallelized according to the base layer configuration.

Parameters:

in_features – Input feature dimension.
out_features – Output feature dimension.
dim – Adapter bottleneck dimension (rank).
base_linear_name – Name of the base linear layer being adapted.
activation – Activation function name (default: ‘swish’).
column_init_method – Initialization method for column parallel layer (default: ‘xavier’).
row_init_method – Initialization method for row parallel layer (default: ‘zero’).
input_is_parallel – Whether input is already parallelized (default: False).
dropout – Dropout probability (default: 0.0).
model_parallel_config – Configuration for model parallelism (default: None).
alpha – Scaling factor for adapter output (default: None, uses dim).
dropout_position – Where to apply dropout (‘pre’ or ‘post’, default: ‘pre’).
a2a_experimental – Whether to use experimental all-to-all communication (default: False).
is_expert – Whether this adapter is for expert layers in MoE (default: False).
disable_sequence_parallel_comm – Whether to disable sequence parallel communication (default: True).
base_linear_is_parallel – Whether the base linear layer uses parallelization (default: True).
sequence_parallel_input_regather – Whether eligible LoRA-A projections retain the sequence-local input and re-gather it in backward using MCore’s sequence-parallel linear path (default: False).

Initialization

Initialize the ParallelLinearAdapter.

Parameters:

in_features – Input feature dimension.
out_features – Output feature dimension.
dim – Adapter bottleneck dimension.
base_linear_name – Name of the base linear layer.
activation – Activation function name.
column_init_method – Initialization for column parallel layers.
row_init_method – Initialization for row parallel layers.
input_is_parallel – Whether input is already parallelized.
dropout – Dropout probability.
model_parallel_config – Model parallelism configuration.
alpha – Scaling factor (uses dim if None).
dropout_position – When to apply dropout.
a2a_experimental – Use experimental all-to-all communication.
is_expert – Whether for expert layers in MoE.
disable_tensor_parallel_comm – Disable tensor parallel communication.
disable_sequence_parallel_comm – Disable sequence parallel communication.
sequence_parallel_input_regather – Re-gather eligible LoRA-A sequence-parallel inputs in backward.

_sequence_parallel_input_regather_eligibility( x: torch.Tensor, ) → tuple[bool, str | None]#: Return whether the targeted sequence-parallel linear path is safe.

_log_sequence_parallel_input_regather_fallback( reason: str | None, ) → None#: Log the first static fallback reason at debug level.

_get_activation_fn(activation: str) → torch.nn.Module#

Get activation function by name.

Parameters:: activation – Name of the activation function.
Returns:: PyTorch activation module.

.. note:: Defaults to Identity if activation name is not recognized.

_get_init_fn( init_method: str, ) → Callable[[torch.Tensor], torch.Tensor]#

Get initialization function by method name.

Parameters:: init_method – Name of the initialization method.
Returns:: Initialization function.
Raises:: NotImplementedError – If init_method is not supported.

forward(x: torch.Tensor, *args, **kwargs) → torch.Tensor#

Forward pass of the parallel linear adapter.

Performs the adaptation computation with proper handling of parallel communication patterns, dropout, and expert routing for MoE scenarios.

Parameters:: x – Input tensor.
Returns:: Adapted output tensor with scaling applied.

local_experts_per_rank() → int#: Return the number of global expert slots owned by this EP rank.

_uses_grouped_expert_sharding() → bool#: Return whether this shared adapter needs an explicit expert axis.

_allreduce_shared_expert_grad(grad: torch.Tensor) → torch.Tensor#: Sum shared expert adapter grads across EP before expert-DP reduction.

_register_shared_expert_grad_sync_hooks() → None#: Keep shared grouped-expert adapters synchronized across EP ranks.

_expert_axis_info( sharded_offsets: Tuple, ) → Tuple[int, int, int]#: Return the global expert-axis sharding metadata for this rank.

_keep_expert_extra_state() → bool#: Keep one unsharded adapter extra-state entry.

_set_expert_replica_ids( *state_dicts: megatron.core.dist_checkpointing.mapping.ShardedStateDict, ) → None#: Mark expert adapter replicas across expert data-parallel ranks.

_apply_expert_axis_factory( sharded_tensor: megatron.core.dist_checkpointing.mapping.ShardedTensor, sharded_offsets: Tuple, *, split_swiglu: bool = False, ) → megatron.core.dist_checkpointing.mapping.ShardedTensorFactory#: Map one shared 2D adapter tensor to this rank’s global expert slots.

sharded_state_dict( prefix: str = '', sharded_offsets: Tuple = (), metadata: Optional[Dict] = None, mamba_dim_info: Optional[Dict] = None, ) → megatron.core.dist_checkpointing.mapping.ShardedStateDict#

Create sharded state dictionary for distributed checkpointing.

Special treatment is given to the linear_fc1 adapter since tensor parallelism is sharded separately for the two logical matrices (gate and up) in SwiGLU.

Parameters:

prefix – Prefix for parameter names.
sharded_offsets – Offsets for sharded parameters.
metadata – Additional metadata for sharding.

Returns:

Sharded state dictionary for distributed checkpointing.

bridge.peft.utils._divide_exact(value: int, divisor: int, name: str) → int#: Divide value by divisor and raise when the result would be fractional.

bridge.peft.utils._apply_grouped_expert_swiglu_sharded_factory( original_sh_ten: megatron.core.dist_checkpointing.mapping.ShardedTensor, sharded_offsets: Tuple, singleton_local_shards: bool = False, ) → megatron.core.dist_checkpointing.mapping.ShardedTensorFactory#: Split grouped-expert SwiGLU tensors along the fused hidden axis for checkpointing.

bridge.peft.utils._append_rank_offset( rank_offsets: List[Tuple[int, int, int]], axis: int, rank: int, axis_fragments: int, ) → None#: Append a sharding offset, combining fragmentations when the axis is already sharded.

bridge.peft.utils._make_grouped_expert_sharded_tensor( tensor: torch.Tensor, key: str, *, tp_axis: Optional[int], sharded_offsets: Tuple, pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None, ep_size_fallback: int = 1, etp_size_fallback: int = 1, ) → megatron.core.dist_checkpointing.mapping.ShardedTensor#

Build a sharded tensor for packed grouped-expert weights.

Grouped-expert LoRA weights shard two independent local axes: the packed expert axis across EP and the adapter matrix axis across expert TP.

class bridge.peft.utils._GroupedExpertAdapterWeight(weight: torch.Tensor)#

Bases: torch.nn.Module

Callable parameter container so DDP forward pre-hooks see grouped LoRA weights.

Initialization

forward( indices: Optional[List[int]] = None, ) → torch.Tensor#

class bridge.peft.utils.GroupedExpertLinearAdapter( in_features: int, out_features: int, dim: int, *, num_local_experts: int, base_linear_name: str, activation: str = 'swish', column_init_method: str = 'xavier', row_init_method: str = 'zero', input_is_parallel: bool = False, dropout: float = 0.0, model_parallel_config: Optional[megatron.core.ModelParallelConfig] = None, alpha: Optional[float] = None, dropout_position: str = 'pre', base_linear_is_parallel: bool = True, params_device: Optional[torch.device] = None, params_dtype: Optional[torch.dtype] = None, pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None = None, )#

Bases: torch.nn.Module

LoRA adapter with one low-rank pair per local grouped MoE expert.

Initialization

Initialize grouped-expert LoRA weights for one adapter per local expert.

_extract_expert_splits( args: Tuple, kwargs: Dict, ) → List[int]#: Extract grouped-expert token splits from wrapped-module call arguments.

_gather_along_last_dim(tensor: torch.Tensor) → torch.Tensor#: Gather a tensor across expert TP ranks by concatenating its last dimension.

_can_use_grouped_mm(x: torch.Tensor) → bool#: Return whether the grouped GEMM fast path is supported for this input.

_is_te_fp8_enabled() → bool#: Return whether Transformer Engine’s FP8 autocast context is active.

_can_use_te_grouped_linear_fp8(x: torch.Tensor) → bool#: Return whether TE can run the grouped adapter weights through FP8.

_get_te_grouped_linear_helper( *, projection: str, num_gemms: int, in_features: int, out_features: int, params_dtype: torch.dtype, device: torch.device, ) → torch.nn.Module#: Create or reuse a meta-device TE helper that owns FP8 runtime state.

_forward_te_grouped_linear_fp8( x: torch.Tensor, *, weight: torch.Tensor, m_splits: List[int], projection: str, active_expert_indices: Tuple[int, ...], ) → torch.Tensor#: Apply externally owned grouped adapter weights through TE’s FP8 kernel.

_forward_te_grouped_linear_fp8_on_current_device( x: torch.Tensor, *, weight: torch.Tensor, m_splits: List[int], projection: str, active_expert_indices: Tuple[int, ...], ) → torch.Tensor#: Run TE FP8 grouped linear after selecting the input tensor’s CUDA device.

_build_grouped_mm_offsets( m_splits: List[int], *, device: torch.device, ) → torch.Tensor#: Build inclusive grouped_mm offsets from per-expert split sizes.

_forward_grouped_projection( x: torch.Tensor, *, weight: torch.Tensor, m_splits: List[int], use_te_fp8: bool, projection: str, active_expert_indices: Tuple[int, ...], offs: Optional[torch.Tensor] = None, ) → torch.Tensor#: Apply one grouped expert projection through the selected grouped backend.

_forward_per_expert( x: torch.Tensor, *, expert_splits: List[int], expert_tp_size: int, ) → torch.Tensor#: Apply the adapter using the per-expert fallback path.

forward(x: torch.Tensor, *args, **kwargs) → torch.Tensor#: Apply the local expert-specific LoRA update to grouped expert inputs.

sharded_state_dict( prefix: str = '', sharded_offsets: Tuple = (), metadata: Optional[Dict] = None, ) → megatron.core.dist_checkpointing.mapping.ShardedStateDict#: Create sharded state dictionary for grouped-expert adapter weights.

bridge.peft.utils._make_cross_ep_replicated(weight: torch.nn.Parameter) → None#

Mark a weight as logically replicated across the intra-PP-stage group.

Megatron’s DDP routes is_expert=True parameters through the expert data-parallel group only, which does not span the EP axis. A weight that must stay bit-identical across all EP ranks (e.g., the shared side of :class:SharedOuterGroupedExpertAdapter, which a serving engine consumes as a single global LoRA tensor) is otherwise left unsynced. This helper closes that gap with two primitives:

a one-shot broadcast from group rank 0 so every rank starts with bit-identical values despite per-rank RNG forks;
a backward hook that SUM all-reduces the gradient across the group so the optimizer step on every rank applies the same update.

SUM is the correct reduction: each rank’s local gradient is the partial loss gradient over its (token, expert) subset, and the total gradient is the sum of those partials. AVG would train at 1/N the intended rate.

The intra-PP-stage group is tensor_and_data_parallel_group with context parallel included, which by Megatron’s construction equals ETP × EP × EDP — all ranks within the current pipeline stage.

Parameters:: weight – The parameter to keep replicated across the group. Must be a leaf parameter so the backward hook fires when its gradient is computed.

class bridge.peft.utils.PackedPerExpertLinear( num_local_experts: int, in_features: int, out_features: int, *, init_method: Optional[Callable] = None, dtype: Optional[torch.dtype] = None, device: Optional[torch.device] = None, )#

Bases: torch.nn.Module

Per-expert linear with a packed 3D weight [N_local, out, in].

Used as the per-expert side of :class:SharedOuterGroupedExpertAdapter. Stores one nn.Parameter (3D) so Bridge’s adapter export sees a single .weight per side, matching the linear_in.weight / linear_out.weight convention in :mod:megatron.bridge.models.conversion.peft_bridge. Forward dispatches to :func:torch._grouped_mm (the same grouped GEMM kernel TE’s

Class:: te.pytorch.GroupedLinear calls) via a single fused op with native autograd, which keeps rank kernel launch counts in lockstep so CP’s ring P2P does not deadlock.

Initialization

forward(x: torch.Tensor, m_splits) → Tuple[torch.Tensor, None]#

sharded_state_dict( prefix: str = '', sharded_offsets: Tuple = (), metadata: Optional[Dict] = None, ) → megatron.core.dist_checkpointing.mapping.ShardedStateDict#: Shard the packed 3D weight along dim 0 (experts) across EP ranks.

class bridge.peft.utils.SharedOuterGroupedExpertAdapter( in_features: int, out_features: int, dim: int, *, num_local_experts: int, base_linear_name: str, activation: str = 'swish', column_init_method: str = 'xavier', row_init_method: str = 'zero', input_is_parallel: bool = False, dropout: float = 0.0, model_parallel_config: Optional[megatron.core.ModelParallelConfig] = None, alpha: Optional[float] = None, dropout_position: str = 'pre', base_linear_is_parallel: bool = True, params_device: Optional[torch.device] = None, params_dtype: Optional[torch.dtype] = None, )#

Bases: torch.nn.Module

LoRA adapter for grouped expert MLP with shared-outer semantics.

Matches SGLang PR #21466’s experts_shared_outer_loras=True contract:

fc1 (gate_up): linear_in = SHARED (hidden -> rank) linear_out = PER-EXPERT (rank -> 2*intermediate)
fc2 (down): linear_in = PER-EXPERT (intermediate -> rank) linear_out = SHARED (rank -> hidden)

The shared side is an is_expert=True ColumnParallelLinear (fc1) or RowParallelLinear (fc2): the TP group is ETP (ETP=1 → local forward), DDP routes the weight through the EDP group, and the logically-replicated cross-EP axis is covered by

Func:: _make_cross_ep_replicated.

The per-expert side is :class:PackedPerExpertLinear (packed 3D weight

func:

torch._grouped_mm) — kept as a single .weight Parameter so Bridge’s adapter-export materializer (which reads linear_in.weight / linear_out.weight) sees a standard single-weight linear per side.

Differs from ParallelLinearAdapter in __init__ and forward; sharded_state_dict is specialized for the packed 3D per-expert side.

Initialization

Initialize shared-outer LoRA weights with one shared and one per-expert side.

forward(x: torch.Tensor, m_splits=None) → torch.Tensor#: Forward. m_splits is the tokens-per-expert split passed through from the base TEGroupedLinear; required for the per-expert side.

sharded_state_dict( prefix: str = '', sharded_offsets: Tuple = (), metadata: Optional[Dict] = None, ) → megatron.core.dist_checkpointing.mapping.ShardedStateDict#: Create sharded state dictionary for mixed shared/per-expert adapter weights.

bridge.peft.utils#

Module Contents#

Classes#

Functions#

Data#

API#

`bridge.peft.utils`#