bridge.peft.utils#

Module Contents#

Classes#

_All2AllHp2Sp

All-2-All from Hidden Parallel to Sequence Parallel.

ParallelLinearAdapter

Parallel Linear Adapter for Parameter-Efficient Fine-Tuning (PEFT) in distributed settings.

Functions#

get_adapter_attributes_from_linear

Returns attributes from the base layer.

is_expert_linear

Return whether the current base module is an expert linear module.

wildcard_match

Return whether the pattern (target module to add LoRA) matches the key (model weight name).

init_method_normal

Create an initialization method based on normal distribution N(0, sigma).

init_method_kaiming_uniform

Create an initialization method based on Kaiming uniform distribution.

init_method_const

Create an initialization method that sets all values to a constant.

pad_seq_to_mult

Pad sequence length to be a multiple of mult.

unpad_seq_to_mult

Remove sequence padding that was added by pad_seq_to_mult.

all2all_hp2sp

Perform All-to-All communication from Hidden Parallel to Sequence Parallel.

Data#

API#

bridge.peft.utils.HAVE_TE#

β€˜all(…)’

bridge.peft.utils.TECL#

()

bridge.peft.utils.TERL#

()

bridge.peft.utils.get_adapter_attributes_from_linear(
m: torch.nn.Module,
) Tuple[bool, int, int, bool, bool]#

Returns attributes from the base layer.

input_is_parallel, in_features, out_features, disable_sequence_parallel_comm, base_linear_is_parallel

This function analyzes a linear module and extracts key attributes needed for adapter configuration, particularly for PEFT adapters in distributed training scenarios.

Parameters:

m – The linear module to analyze (should have a config attribute).

Returns:

  • input_is_parallel: Whether the input is already parallelized

  • in_features: Input feature dimension

  • out_features: Output feature dimension

  • disable_sequence_parallel_comm: Whether to disable sequence parallel communication

  • base_linear_is_parallel: Whether the base linear layer uses parallelization

Return type:

A tuple containing

Raises:

NotImplementedError – If the layer type is not recognized for LoRA adaptation.

bridge.peft.utils.is_expert_linear(fqn: str) bool#

Return whether the current base module is an expert linear module.

This function checks if a fully qualified name (FQN) corresponds to an expert linear module in a Mixture of Experts (MoE) architecture.

Parameters:

fqn – Fully qualified name of the module.

Returns:

True if the module is an expert linear module, False otherwise.

.. rubric:: Example

is_expert_linear(β€œmodel.layers.0.mlp.experts.0.linear_fc1”) True is_expert_linear(β€œmodel.layers.0.mlp.linear_fc1”) False

bridge.peft.utils.wildcard_match(
pattern: str,
key: Optional[str],
) Optional[bool]#

Return whether the pattern (target module to add LoRA) matches the key (model weight name).

This function performs wildcard matching using β€˜*’ as a placeholder for any substring.

Parameters:
  • pattern – Pattern string with wildcards (*) to match against.

  • key – Key string to test against the pattern.

Returns:

True if the pattern matches the key, False if it doesn’t, None if key is None.

.. rubric:: Example

wildcard_match(”.layers.0..linear_qkv”, β€œdecoder.layers.0.self_attention.linear_qkv”) True wildcard_match(”.layers.0..linear_qkv”, β€œdecoder.layers.1.self_attention.linear_qkv”) False

bridge.peft.utils.init_method_normal(
sigma: float,
) Callable[[torch.Tensor], torch.Tensor]#

Create an initialization method based on normal distribution N(0, sigma).

Parameters:

sigma – Standard deviation for the normal distribution.

Returns:

Initialization function that applies normal distribution to a tensor.

bridge.peft.utils.init_method_kaiming_uniform(
val: float,
) Callable[[torch.Tensor], torch.Tensor]#

Create an initialization method based on Kaiming uniform distribution.

Parameters:

val – The β€˜a’ parameter for Kaiming uniform initialization.

Returns:

Initialization function that applies Kaiming uniform distribution to a tensor.

bridge.peft.utils.init_method_const(
val: float,
) Callable[[torch.Tensor], torch.Tensor]#

Create an initialization method that sets all values to a constant.

Parameters:

val – Constant value to initialize the tensor with.

Returns:

Initialization function that sets tensor to constant value.

bridge.peft.utils.pad_seq_to_mult(
x: torch.Tensor,
mult: int,
) Tuple[torch.Tensor, int]#

Pad sequence length to be a multiple of mult.

This function pads the first dimension of the tensor to ensure it’s divisible by mult. Used primarily for MoE (Mixture of Experts) operations that require specific sequence lengths.

Parameters:
  • x – Input tensor to pad.

  • mult – Multiple that the sequence length should be divisible by.

Returns:

  • Padded tensor

  • Number of padding elements added

Return type:

A tuple containing

bridge.peft.utils.unpad_seq_to_mult(x: torch.Tensor, pad_len: int) torch.Tensor#

Remove sequence padding that was added by pad_seq_to_mult.

Parameters:
  • x – Padded tensor to unpad.

  • pad_len – Number of padding elements to remove from the end.

Returns:

Unpadded tensor with pad_len elements removed from the first dimension.

class bridge.peft.utils._All2AllHp2Sp#

Bases: torch.autograd.Function

All-2-All from Hidden Parallel to Sequence Parallel.

This is a temporary workaround for distributed communication patterns and can be updated in the future. It performs all-to-all communication to transform from hidden parallel to sequence parallel layout.

TODO: Move the functionality to MCore

static forward(ctx, input_: torch.Tensor) torch.Tensor#

Forward pass: All-to-All from Hidden Parallel to Sequence Parallel.

Parameters:
  • ctx – Autograd context (unused but required by Function interface).

  • input_ – Input tensor in hidden parallel layout.

Returns:

Output tensor in sequence parallel layout.

static backward(ctx, grad_output: torch.Tensor) torch.Tensor#

Backward pass: All-to-All from Sequence Parallel to Hidden Parallel.

Parameters:
  • ctx – Autograd context (unused but required by Function interface).

  • grad_output – Gradient tensor in sequence parallel layout.

Returns:

Gradient tensor in hidden parallel layout.

bridge.peft.utils.all2all_hp2sp(input_: torch.Tensor) torch.Tensor#

Perform All-to-All communication from Hidden Parallel to Sequence Parallel.

Parameters:

input_ – Input tensor in hidden parallel layout.

Returns:

Output tensor in sequence parallel layout.

class bridge.peft.utils.ParallelLinearAdapter(
in_features: int,
out_features: int,
dim: int,
base_linear_name: str,
activation: str = 'swish',
column_init_method: str = 'xavier',
row_init_method: str = 'zero',
input_is_parallel: bool = False,
dropout: float = 0.0,
model_parallel_config: Optional[megatron.core.ModelParallelConfig] = None,
alpha: Optional[float] = None,
dropout_position: str = 'post',
a2a_experimental: bool = False,
is_expert: bool = False,
disable_sequence_parallel_comm: bool = True,
base_linear_is_parallel: bool = True,
**kwargs,
)#

Bases: torch.nn.Module

Parallel Linear Adapter for Parameter-Efficient Fine-Tuning (PEFT) in distributed settings.

This adapter implements a low-rank adaptation pattern using two linear layers with configurable parallelization strategies. It supports both tensor and sequence parallelism patterns used in large language model training.

The adapter follows the pattern: input -> linear_in -> activation -> linear_out -> scaling where linear_in and linear_out are parallelized according to the base layer configuration.

Parameters:
  • in_features – Input feature dimension.

  • out_features – Output feature dimension.

  • dim – Adapter bottleneck dimension (rank).

  • base_linear_name – Name of the base linear layer being adapted.

  • activation – Activation function name (default: β€˜swish’).

  • column_init_method – Initialization method for column parallel layer (default: β€˜xavier’).

  • row_init_method – Initialization method for row parallel layer (default: β€˜zero’).

  • input_is_parallel – Whether input is already parallelized (default: False).

  • dropout – Dropout probability (default: 0.0).

  • model_parallel_config – Configuration for model parallelism (default: None).

  • alpha – Scaling factor for adapter output (default: None, uses dim).

  • dropout_position – Where to apply dropout (β€˜pre’ or β€˜post’, default: β€˜post’).

  • a2a_experimental – Whether to use experimental all-to-all communication (default: False).

  • is_expert – Whether this adapter is for expert layers in MoE (default: False).

  • disable_sequence_parallel_comm – Whether to disable sequence parallel communication (default: True).

  • base_linear_is_parallel – Whether the base linear layer uses parallelization (default: True).

Initialization

Initialize the ParallelLinearAdapter.

Parameters:
  • in_features – Input feature dimension.

  • out_features – Output feature dimension.

  • dim – Adapter bottleneck dimension.

  • base_linear_name – Name of the base linear layer.

  • activation – Activation function name.

  • column_init_method – Initialization for column parallel layers.

  • row_init_method – Initialization for row parallel layers.

  • input_is_parallel – Whether input is already parallelized.

  • dropout – Dropout probability.

  • model_parallel_config – Model parallelism configuration.

  • alpha – Scaling factor (uses dim if None).

  • dropout_position – When to apply dropout.

  • a2a_experimental – Use experimental all-to-all communication.

  • is_expert – Whether for expert layers in MoE.

  • disable_sequence_parallel_comm – Disable sequence parallel communication.

  • dropout_recompute – Use recomputation for dropout.

  • **kwargs – Additional keyword arguments.

_get_activation_fn(activation: str) torch.nn.Module#

Get activation function by name.

Parameters:

activation – Name of the activation function.

Returns:

PyTorch activation module.

.. note:: Defaults to Identity if activation name is not recognized.

_get_init_fn(
init_method: str,
) Callable[[torch.Tensor], torch.Tensor]#

Get initialization function by method name.

Parameters:

init_method – Name of the initialization method.

Returns:

Initialization function.

Raises:

NotImplementedError – If init_method is not supported.

forward(x: torch.Tensor) torch.Tensor#

Forward pass of the parallel linear adapter.

Performs the adaptation computation with proper handling of parallel communication patterns, dropout, and expert routing for MoE scenarios.

Parameters:

x – Input tensor.

Returns:

Adapted output tensor with scaling applied.

sharded_state_dict(
prefix: str = '',
sharded_offsets: Tuple = (),
metadata: Optional[Dict] = None,
) megatron.core.dist_checkpointing.mapping.ShardedStateDict#

Create sharded state dictionary for distributed checkpointing.

Special treatment is given to the linear_fc1 adapter since tensor parallelism is sharded separately for the two logical matrices (gate and up) in SwiGLU.

Parameters:
  • prefix – Prefix for parameter names.

  • sharded_offsets – Offsets for sharded parameters.

  • metadata – Additional metadata for sharding.

Returns:

Sharded state dictionary for distributed checkpointing.