bridge.peft.utils#

Module Contents#

Classes#

AdapterAttributes

Container for base linear adapter attributes.

_All2AllHp2Sp

All-2-All from Hidden Parallel to Sequence Parallel.

ParallelLinearAdapter

Parallel Linear Adapter for Parameter-Efficient Fine-Tuning (PEFT) in distributed settings.

GroupedExpertLinearAdapter

LoRA adapter with one low-rank pair per local grouped MoE expert.

Functions#

create_peft

Create a Bridge PEFT object from a small config mapping.

create_peft_hook

Create a provider pre-wrap hook that applies PEFT.

load_peft_adapter_checkpoint

Load a PEFT adapter checkpoint into an already transformed model.

_apply_peft

Apply PEFT and mark adapter parameters for checkpointing.

_import_peft_class

_model_state_dict

Generate Bridge model checkpoint sections for an external trainer.

_ensure_model_list

is_modelopt_linear

Return whether a module is ModelOpt’s local Megatron Linear.

get_adapter_attributes_from_linear

Returns attributes from the base layer as an AdapterAttributes dataclass.

is_expert_linear

Return whether the current base module is an expert linear module.

is_grouped_expert_linear

Return whether the current base module is a grouped expert linear module.

get_effective_lora_dim

Return the LoRA rank to use, reduced for expert layers when normalize_moe_lora is enabled.

align_expert_dim_for_tp

Round normalized expert LoRA ranks up to the expert-TP granularity when needed.

wildcard_match

Return whether the pattern (target module to add LoRA) matches the key (model weight name).

init_method_normal

Create an initialization method based on normal distribution N(0, sigma).

init_method_kaiming_uniform

Create an initialization method based on Kaiming uniform distribution.

init_method_const

Create an initialization method that sets all values to a constant.

pad_seq_to_mult

Pad sequence length to be a multiple of mult.

unpad_seq_to_mult

Remove sequence padding that was added by pad_seq_to_mult.

all2all_hp2sp

Perform All-to-All communication from Hidden Parallel to Sequence Parallel.

_divide_exact

Divide value by divisor and raise when the result would be fractional.

_apply_grouped_expert_swiglu_sharded_factory

Split grouped-expert SwiGLU tensors along the fused hidden axis for checkpointing.

_append_rank_offset

Append a sharding offset, combining fragmentations when the axis is already sharded.

_make_grouped_expert_sharded_tensor

Build a sharded tensor for packed grouped-expert weights.

Data#

API#

bridge.peft.utils.logger#

β€˜getLogger(…)’

bridge.peft.utils.ModelList#

None

bridge.peft.utils.ModelHook#

None

bridge.peft.utils.CheckpointPath#

None

bridge.peft.utils.HAVE_TE#

β€˜all(…)’

bridge.peft.utils.TECL#

()

bridge.peft.utils.TERL#

()

bridge.peft.utils.create_peft(
config: Mapping[str, Any],
dtype: torch.dtype | str | int | None = None,
) object | None#

Create a Bridge PEFT object from a small config mapping.

bridge.peft.utils.create_peft_hook(
peft: object,
training: bool = True,
) bridge.peft.utils.ModelHook#

Create a provider pre-wrap hook that applies PEFT.

bridge.peft.utils.load_peft_adapter_checkpoint(
model: bridge.peft.utils.ModelList | megatron.core.transformer.module.MegatronModule,
adapter_checkpoint_path: bridge.peft.utils.CheckpointPath,
peft: object,
strict: bool = False,
model_sd_kwargs: Mapping[str, object] | None = None,
ckpt_format: str = 'torch_dist',
pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None = None,
fully_parallel_load: bool = True,
load_strategy: object | None = None,
) None#

Load a PEFT adapter checkpoint into an already transformed model.

bridge.peft.utils._apply_peft(
peft: object,
model: bridge.peft.utils.ModelList,
training: bool = True,
) bridge.peft.utils.ModelList#

Apply PEFT and mark adapter parameters for checkpointing.

bridge.peft.utils._import_peft_class(peft_type: str) type[Any]#
bridge.peft.utils._model_state_dict(
model: bridge.peft.utils.ModelList,
model_sd_kwargs: Mapping[str, object] | None = None,
ckpt_format: str = 'torch_dist',
pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None = None,
) dict[str, Any]#

Generate Bridge model checkpoint sections for an external trainer.

bridge.peft.utils._ensure_model_list(
model: bridge.peft.utils.ModelList | megatron.core.transformer.module.MegatronModule,
) bridge.peft.utils.ModelList#
bridge.peft.utils.is_modelopt_linear(m: torch.nn.Module) bool#

Return whether a module is ModelOpt’s local Megatron Linear.

class bridge.peft.utils.AdapterAttributes#

Container for base linear adapter attributes.

input_is_parallel: bool#

None

in_features: int#

None

out_features: int#

None

disable_tensor_parallel_comm: bool#

None

disable_sequence_parallel_comm: bool#

None

base_linear_is_parallel: bool#

None

bridge.peft.utils.get_adapter_attributes_from_linear(
m: torch.nn.Module,
is_expert: bool = False,
) bridge.peft.utils.AdapterAttributes#

Returns attributes from the base layer as an AdapterAttributes dataclass.

input_is_parallel, in_features, out_features, disable_tensor_parallel_comm, disable_sequence_parallel_comm, base_linear_is_parallel

This function analyzes a linear module and extracts key attributes needed for adapter configuration, particularly for PEFT adapters in distributed training scenarios.

Parameters:

m – The linear module to analyze (should have a config attribute).

Returns:

  • input_is_parallel: Whether the input is already parallelized

  • in_features: Input feature dimension

  • out_features: Output feature dimension

  • disable_tensor_parallel_comm: Whether to disable tensor parallel communication

  • disable_sequence_parallel_comm: Whether to disable sequence parallel communication

  • base_linear_is_parallel: Whether the base linear layer uses parallelization

Return type:

AdapterAttributes containing

Raises:

NotImplementedError – If the layer type is not recognized for LoRA adaptation.

bridge.peft.utils.is_expert_linear(fqn: str) bool#

Return whether the current base module is an expert linear module.

This function checks if a fully qualified name (FQN) corresponds to an expert linear module in a Mixture of Experts (MoE) architecture.

Parameters:

fqn – Fully qualified name of the module.

Returns:

True if the module is an expert linear module, False otherwise.

.. rubric:: Example

is_expert_linear(β€œmodel.layers.0.mlp.experts.0.linear_fc1”) True is_expert_linear(β€œmodel.layers.0.mlp.linear_fc1”) False

bridge.peft.utils.is_grouped_expert_linear(fqn: str) bool#

Return whether the current base module is a grouped expert linear module.

bridge.peft.utils.get_effective_lora_dim(
module: torch.nn.Module,
*,
dim: int,
normalize_moe_lora: bool,
is_expert: bool,
) int#

Return the LoRA rank to use, reduced for expert layers when normalize_moe_lora is enabled.

bridge.peft.utils.align_expert_dim_for_tp(
module: torch.nn.Module,
dim: int,
*,
normalize_moe_lora: bool,
is_expert: bool,
input_is_parallel: bool,
) int#

Round normalized expert LoRA ranks up to the expert-TP granularity when needed.

bridge.peft.utils.wildcard_match(
pattern: str,
key: Optional[str],
) Optional[bool]#

Return whether the pattern (target module to add LoRA) matches the key (model weight name).

This function performs wildcard matching using β€˜*’ as a placeholder for any substring.

Parameters:
  • pattern – Pattern string with wildcards (*) to match against.

  • key – Key string to test against the pattern.

Returns:

True if the pattern matches the key, False if it doesn’t, None if key is None.

.. rubric:: Example

wildcard_match(”.layers.0..linear_qkv”, β€œdecoder.layers.0.self_attention.linear_qkv”) True wildcard_match(”.layers.0..linear_qkv”, β€œdecoder.layers.1.self_attention.linear_qkv”) False

bridge.peft.utils.init_method_normal(
sigma: float,
) Callable[[torch.Tensor], torch.Tensor]#

Create an initialization method based on normal distribution N(0, sigma).

Parameters:

sigma – Standard deviation for the normal distribution.

Returns:

Initialization function that applies normal distribution to a tensor.

bridge.peft.utils.init_method_kaiming_uniform(
val: float,
) Callable[[torch.Tensor], torch.Tensor]#

Create an initialization method based on Kaiming uniform distribution.

Parameters:

val – The β€˜a’ parameter for Kaiming uniform initialization.

Returns:

Initialization function that applies Kaiming uniform distribution to a tensor.

bridge.peft.utils.init_method_const(
val: float,
) Callable[[torch.Tensor], torch.Tensor]#

Create an initialization method that sets all values to a constant.

Parameters:

val – Constant value to initialize the tensor with.

Returns:

Initialization function that sets tensor to constant value.

bridge.peft.utils.pad_seq_to_mult(
x: torch.Tensor,
mult: int,
) Tuple[torch.Tensor, int]#

Pad sequence length to be a multiple of mult.

This function pads the first dimension of the tensor to ensure it’s divisible by mult. Used primarily for MoE (Mixture of Experts) operations that require specific sequence lengths.

Parameters:
  • x – Input tensor to pad.

  • mult – Multiple that the sequence length should be divisible by.

Returns:

  • Padded tensor

  • Number of padding elements added

Return type:

A tuple containing

bridge.peft.utils.unpad_seq_to_mult(x: torch.Tensor, pad_len: int) torch.Tensor#

Remove sequence padding that was added by pad_seq_to_mult.

Parameters:
  • x – Padded tensor to unpad.

  • pad_len – Number of padding elements to remove from the end.

Returns:

Unpadded tensor with pad_len elements removed from the first dimension.

class bridge.peft.utils._All2AllHp2Sp#

Bases: torch.autograd.Function

All-2-All from Hidden Parallel to Sequence Parallel.

This is a temporary workaround for distributed communication patterns and can be updated in the future. It performs all-to-all communication to transform from hidden parallel to sequence parallel layout.

TODO: Move the functionality to MCore

static forward(ctx, input_: torch.Tensor) torch.Tensor#

Forward pass: All-to-All from Hidden Parallel to Sequence Parallel.

Parameters:
  • ctx – Autograd context (unused but required by Function interface).

  • input_ – Input tensor in hidden parallel layout.

Returns:

Output tensor in sequence parallel layout.

static backward(ctx, grad_output: torch.Tensor) torch.Tensor#

Backward pass: All-to-All from Sequence Parallel to Hidden Parallel.

Parameters:
  • ctx – Autograd context (unused but required by Function interface).

  • grad_output – Gradient tensor in sequence parallel layout.

Returns:

Gradient tensor in hidden parallel layout.

bridge.peft.utils.all2all_hp2sp(input_: torch.Tensor) torch.Tensor#

Perform All-to-All communication from Hidden Parallel to Sequence Parallel.

Parameters:

input_ – Input tensor in hidden parallel layout.

Returns:

Output tensor in sequence parallel layout.

class bridge.peft.utils.ParallelLinearAdapter(
in_features: int,
out_features: int,
dim: int,
base_linear_name: str,
activation: str = 'swish',
column_init_method: str = 'xavier',
row_init_method: str = 'zero',
input_is_parallel: bool = False,
dropout: float = 0.0,
model_parallel_config: Optional[megatron.core.ModelParallelConfig] = None,
alpha: Optional[float] = None,
dropout_position: str = 'pre',
a2a_experimental: bool = False,
is_expert: bool = False,
disable_tensor_parallel_comm: bool = False,
disable_sequence_parallel_comm: bool = True,
base_linear_is_parallel: bool = True,
)#

Bases: torch.nn.Module

Parallel Linear Adapter for Parameter-Efficient Fine-Tuning (PEFT) in distributed settings.

This adapter implements a low-rank adaptation pattern using two linear layers with configurable parallelization strategies. It supports both tensor and sequence parallelism patterns used in large language model training.

The adapter follows the pattern: input -> linear_in -> activation -> linear_out -> scaling where linear_in and linear_out are parallelized according to the base layer configuration.

Parameters:
  • in_features – Input feature dimension.

  • out_features – Output feature dimension.

  • dim – Adapter bottleneck dimension (rank).

  • base_linear_name – Name of the base linear layer being adapted.

  • activation – Activation function name (default: β€˜swish’).

  • column_init_method – Initialization method for column parallel layer (default: β€˜xavier’).

  • row_init_method – Initialization method for row parallel layer (default: β€˜zero’).

  • input_is_parallel – Whether input is already parallelized (default: False).

  • dropout – Dropout probability (default: 0.0).

  • model_parallel_config – Configuration for model parallelism (default: None).

  • alpha – Scaling factor for adapter output (default: None, uses dim).

  • dropout_position – Where to apply dropout (β€˜pre’ or β€˜post’, default: β€˜pre’).

  • a2a_experimental – Whether to use experimental all-to-all communication (default: False).

  • is_expert – Whether this adapter is for expert layers in MoE (default: False).

  • disable_sequence_parallel_comm – Whether to disable sequence parallel communication (default: True).

  • base_linear_is_parallel – Whether the base linear layer uses parallelization (default: True).

Initialization

Initialize the ParallelLinearAdapter.

Parameters:
  • in_features – Input feature dimension.

  • out_features – Output feature dimension.

  • dim – Adapter bottleneck dimension.

  • base_linear_name – Name of the base linear layer.

  • activation – Activation function name.

  • column_init_method – Initialization for column parallel layers.

  • row_init_method – Initialization for row parallel layers.

  • input_is_parallel – Whether input is already parallelized.

  • dropout – Dropout probability.

  • model_parallel_config – Model parallelism configuration.

  • alpha – Scaling factor (uses dim if None).

  • dropout_position – When to apply dropout.

  • a2a_experimental – Use experimental all-to-all communication.

  • is_expert – Whether for expert layers in MoE.

  • disable_tensor_parallel_comm – Disable tensor parallel communication.

  • disable_sequence_parallel_comm – Disable sequence parallel communication.

  • dropout_recompute – Use recomputation for dropout.

_get_activation_fn(activation: str) torch.nn.Module#

Get activation function by name.

Parameters:

activation – Name of the activation function.

Returns:

PyTorch activation module.

.. note:: Defaults to Identity if activation name is not recognized.

_get_init_fn(
init_method: str,
) Callable[[torch.Tensor], torch.Tensor]#

Get initialization function by method name.

Parameters:

init_method – Name of the initialization method.

Returns:

Initialization function.

Raises:

NotImplementedError – If init_method is not supported.

forward(x: torch.Tensor, *args, **kwargs) torch.Tensor#

Forward pass of the parallel linear adapter.

Performs the adaptation computation with proper handling of parallel communication patterns, dropout, and expert routing for MoE scenarios.

Parameters:

x – Input tensor.

Returns:

Adapted output tensor with scaling applied.

sharded_state_dict(
prefix: str = '',
sharded_offsets: Tuple = (),
metadata: Optional[Dict] = None,
mamba_dim_info: Optional[Dict] = None,
) megatron.core.dist_checkpointing.mapping.ShardedStateDict#

Create sharded state dictionary for distributed checkpointing.

Special treatment is given to the linear_fc1 adapter since tensor parallelism is sharded separately for the two logical matrices (gate and up) in SwiGLU.

Parameters:
  • prefix – Prefix for parameter names.

  • sharded_offsets – Offsets for sharded parameters.

  • metadata – Additional metadata for sharding.

Returns:

Sharded state dictionary for distributed checkpointing.

bridge.peft.utils._divide_exact(value: int, divisor: int, name: str) int#

Divide value by divisor and raise when the result would be fractional.

bridge.peft.utils._apply_grouped_expert_swiglu_sharded_factory(
original_sh_ten: megatron.core.dist_checkpointing.mapping.ShardedTensor,
sharded_offsets: Tuple,
singleton_local_shards: bool = False,
) megatron.core.dist_checkpointing.mapping.ShardedTensorFactory#

Split grouped-expert SwiGLU tensors along the fused hidden axis for checkpointing.

bridge.peft.utils._append_rank_offset(
rank_offsets: List[Tuple[int, int, int]],
axis: int,
rank: int,
axis_fragments: int,
) None#

Append a sharding offset, combining fragmentations when the axis is already sharded.

bridge.peft.utils._make_grouped_expert_sharded_tensor(
tensor: torch.Tensor,
key: str,
*,
tp_axis: Optional[int],
sharded_offsets: Tuple,
) megatron.core.dist_checkpointing.mapping.ShardedTensor#

Build a sharded tensor for packed grouped-expert weights.

Grouped-expert LoRA weights shard two independent local axes: the packed expert axis across EP and the adapter matrix axis across expert TP.

class bridge.peft.utils.GroupedExpertLinearAdapter(
in_features: int,
out_features: int,
dim: int,
*,
num_local_experts: int,
base_linear_name: str,
activation: str = 'swish',
column_init_method: str = 'xavier',
row_init_method: str = 'zero',
input_is_parallel: bool = False,
dropout: float = 0.0,
model_parallel_config: Optional[megatron.core.ModelParallelConfig] = None,
alpha: Optional[float] = None,
dropout_position: str = 'pre',
base_linear_is_parallel: bool = True,
params_device: Optional[torch.device] = None,
params_dtype: Optional[torch.dtype] = None,
)#

Bases: torch.nn.Module

LoRA adapter with one low-rank pair per local grouped MoE expert.

Initialization

Initialize grouped-expert LoRA weights for one adapter per local expert.

_extract_expert_splits(
args: Tuple,
kwargs: Dict,
) List[int]#

Extract grouped-expert token splits from wrapped-module call arguments.

_gather_along_last_dim(tensor: torch.Tensor) torch.Tensor#

Gather a tensor across expert TP ranks by concatenating its last dimension.

_can_use_grouped_mm(x: torch.Tensor) bool#

Return whether the grouped GEMM fast path is supported for this input.

_is_te_grouped_mlp_call(
args: Tuple,
kwargs: Dict,
) bool#

Return whether the wrapped base layer is being invoked from TEGroupedMLP.

TEGroupedMLP forwards tokens_per_expert positionally into grouped linears after converting it to a Python list, while grouped-GEMM callers use m_splits.

_can_use_te_grouped_linear(x: torch.Tensor) bool#

Return whether the TEGroupedMLP fast path is supported for this input.

_get_te_grouped_linear_helper(
*,
num_gemms: int,
in_features: int,
out_features: int,
params_dtype: torch.dtype,
) torch.nn.Module#

Create or reuse a lightweight TE GroupedLinear helper for the requested shape.

_forward_te_grouped_linear(
x: torch.Tensor,
*,
weight: torch.Tensor,
m_splits: List[int],
) torch.Tensor#

Apply a grouped expert projection with TE’s grouped-linear autograd kernel.

_build_grouped_mm_offsets(
m_splits: List[int],
*,
device: torch.device,
) torch.Tensor#

Build inclusive grouped_mm offsets from per-expert split sizes.

_forward_grouped_projection(
x: torch.Tensor,
*,
weight: torch.Tensor,
m_splits: List[int],
use_te_grouped_linear: bool,
offs: Optional[torch.Tensor] = None,
) torch.Tensor#

Apply one grouped expert projection through the selected fast-path backend.

_forward_per_expert(
x: torch.Tensor,
*,
expert_splits: List[int],
expert_tp_size: int,
) torch.Tensor#

Apply the adapter using the per-expert fallback path.

forward(x: torch.Tensor, *args, **kwargs) torch.Tensor#

Apply the local expert-specific LoRA update to grouped expert inputs.

sharded_state_dict(
prefix: str = '',
sharded_offsets: Tuple = (),
metadata: Optional[Dict] = None,
) megatron.core.dist_checkpointing.mapping.ShardedStateDict#

Create sharded state dictionary for grouped-expert adapter weights.