bridge.peft.utils
#
Module Contents#
Classes#
All-2-All from Hidden Parallel to Sequence Parallel. |
|
Parallel Linear Adapter for Parameter-Efficient Fine-Tuning (PEFT) in distributed settings. |
Functions#
Returns attributes from the base layer. |
|
Return whether the current base module is an expert linear module. |
|
Return whether the pattern (target module to add LoRA) matches the key (model weight name). |
|
Create an initialization method based on normal distribution N(0, sigma). |
|
Create an initialization method based on Kaiming uniform distribution. |
|
Create an initialization method that sets all values to a constant. |
|
Pad sequence length to be a multiple of mult. |
|
Remove sequence padding that was added by pad_seq_to_mult. |
|
Perform All-to-All communication from Hidden Parallel to Sequence Parallel. |
Data#
API#
- bridge.peft.utils.HAVE_TE#
βall(β¦)β
- bridge.peft.utils.TECL#
()
- bridge.peft.utils.TERL#
()
- bridge.peft.utils.get_adapter_attributes_from_linear(
- m: torch.nn.Module,
Returns attributes from the base layer.
input_is_parallel, in_features, out_features, disable_sequence_parallel_comm, base_linear_is_parallel
This function analyzes a linear module and extracts key attributes needed for adapter configuration, particularly for PEFT adapters in distributed training scenarios.
- Parameters:
m β The linear module to analyze (should have a config attribute).
- Returns:
input_is_parallel: Whether the input is already parallelized
in_features: Input feature dimension
out_features: Output feature dimension
disable_sequence_parallel_comm: Whether to disable sequence parallel communication
base_linear_is_parallel: Whether the base linear layer uses parallelization
- Return type:
A tuple containing
- Raises:
NotImplementedError β If the layer type is not recognized for LoRA adaptation.
- bridge.peft.utils.is_expert_linear(fqn: str) bool #
Return whether the current base module is an expert linear module.
This function checks if a fully qualified name (FQN) corresponds to an expert linear module in a Mixture of Experts (MoE) architecture.
- Parameters:
fqn β Fully qualified name of the module.
- Returns:
True if the module is an expert linear module, False otherwise.
.. rubric:: Example
is_expert_linear(βmodel.layers.0.mlp.experts.0.linear_fc1β) True is_expert_linear(βmodel.layers.0.mlp.linear_fc1β) False
- bridge.peft.utils.wildcard_match(
- pattern: str,
- key: Optional[str],
Return whether the pattern (target module to add LoRA) matches the key (model weight name).
This function performs wildcard matching using β*β as a placeholder for any substring.
- Parameters:
pattern β Pattern string with wildcards (*) to match against.
key β Key string to test against the pattern.
- Returns:
True if the pattern matches the key, False if it doesnβt, None if key is None.
.. rubric:: Example
wildcard_match(β.layers.0..linear_qkvβ, βdecoder.layers.0.self_attention.linear_qkvβ) True wildcard_match(β.layers.0..linear_qkvβ, βdecoder.layers.1.self_attention.linear_qkvβ) False
- bridge.peft.utils.init_method_normal(
- sigma: float,
Create an initialization method based on normal distribution N(0, sigma).
- Parameters:
sigma β Standard deviation for the normal distribution.
- Returns:
Initialization function that applies normal distribution to a tensor.
- bridge.peft.utils.init_method_kaiming_uniform(
- val: float,
Create an initialization method based on Kaiming uniform distribution.
- Parameters:
val β The βaβ parameter for Kaiming uniform initialization.
- Returns:
Initialization function that applies Kaiming uniform distribution to a tensor.
- bridge.peft.utils.init_method_const(
- val: float,
Create an initialization method that sets all values to a constant.
- Parameters:
val β Constant value to initialize the tensor with.
- Returns:
Initialization function that sets tensor to constant value.
- bridge.peft.utils.pad_seq_to_mult(
- x: torch.Tensor,
- mult: int,
Pad sequence length to be a multiple of mult.
This function pads the first dimension of the tensor to ensure itβs divisible by mult. Used primarily for MoE (Mixture of Experts) operations that require specific sequence lengths.
- Parameters:
x β Input tensor to pad.
mult β Multiple that the sequence length should be divisible by.
- Returns:
Padded tensor
Number of padding elements added
- Return type:
A tuple containing
- bridge.peft.utils.unpad_seq_to_mult(x: torch.Tensor, pad_len: int) torch.Tensor #
Remove sequence padding that was added by pad_seq_to_mult.
- Parameters:
x β Padded tensor to unpad.
pad_len β Number of padding elements to remove from the end.
- Returns:
Unpadded tensor with pad_len elements removed from the first dimension.
- class bridge.peft.utils._All2AllHp2Sp#
Bases:
torch.autograd.Function
All-2-All from Hidden Parallel to Sequence Parallel.
This is a temporary workaround for distributed communication patterns and can be updated in the future. It performs all-to-all communication to transform from hidden parallel to sequence parallel layout.
TODO: Move the functionality to MCore
- static forward(ctx, input_: torch.Tensor) torch.Tensor #
Forward pass: All-to-All from Hidden Parallel to Sequence Parallel.
- Parameters:
ctx β Autograd context (unused but required by Function interface).
input_ β Input tensor in hidden parallel layout.
- Returns:
Output tensor in sequence parallel layout.
- static backward(ctx, grad_output: torch.Tensor) torch.Tensor #
Backward pass: All-to-All from Sequence Parallel to Hidden Parallel.
- Parameters:
ctx β Autograd context (unused but required by Function interface).
grad_output β Gradient tensor in sequence parallel layout.
- Returns:
Gradient tensor in hidden parallel layout.
- bridge.peft.utils.all2all_hp2sp(input_: torch.Tensor) torch.Tensor #
Perform All-to-All communication from Hidden Parallel to Sequence Parallel.
- Parameters:
input_ β Input tensor in hidden parallel layout.
- Returns:
Output tensor in sequence parallel layout.
- class bridge.peft.utils.ParallelLinearAdapter(
- in_features: int,
- out_features: int,
- dim: int,
- base_linear_name: str,
- activation: str = 'swish',
- column_init_method: str = 'xavier',
- row_init_method: str = 'zero',
- input_is_parallel: bool = False,
- dropout: float = 0.0,
- model_parallel_config: Optional[megatron.core.ModelParallelConfig] = None,
- alpha: Optional[float] = None,
- dropout_position: str = 'post',
- a2a_experimental: bool = False,
- is_expert: bool = False,
- disable_sequence_parallel_comm: bool = True,
- base_linear_is_parallel: bool = True,
- **kwargs,
Bases:
torch.nn.Module
Parallel Linear Adapter for Parameter-Efficient Fine-Tuning (PEFT) in distributed settings.
This adapter implements a low-rank adaptation pattern using two linear layers with configurable parallelization strategies. It supports both tensor and sequence parallelism patterns used in large language model training.
The adapter follows the pattern: input -> linear_in -> activation -> linear_out -> scaling where linear_in and linear_out are parallelized according to the base layer configuration.
- Parameters:
in_features β Input feature dimension.
out_features β Output feature dimension.
dim β Adapter bottleneck dimension (rank).
base_linear_name β Name of the base linear layer being adapted.
activation β Activation function name (default: βswishβ).
column_init_method β Initialization method for column parallel layer (default: βxavierβ).
row_init_method β Initialization method for row parallel layer (default: βzeroβ).
input_is_parallel β Whether input is already parallelized (default: False).
dropout β Dropout probability (default: 0.0).
model_parallel_config β Configuration for model parallelism (default: None).
alpha β Scaling factor for adapter output (default: None, uses dim).
dropout_position β Where to apply dropout (βpreβ or βpostβ, default: βpostβ).
a2a_experimental β Whether to use experimental all-to-all communication (default: False).
is_expert β Whether this adapter is for expert layers in MoE (default: False).
disable_sequence_parallel_comm β Whether to disable sequence parallel communication (default: True).
base_linear_is_parallel β Whether the base linear layer uses parallelization (default: True).
Initialization
Initialize the ParallelLinearAdapter.
- Parameters:
in_features β Input feature dimension.
out_features β Output feature dimension.
dim β Adapter bottleneck dimension.
base_linear_name β Name of the base linear layer.
activation β Activation function name.
column_init_method β Initialization for column parallel layers.
row_init_method β Initialization for row parallel layers.
input_is_parallel β Whether input is already parallelized.
dropout β Dropout probability.
model_parallel_config β Model parallelism configuration.
alpha β Scaling factor (uses dim if None).
dropout_position β When to apply dropout.
a2a_experimental β Use experimental all-to-all communication.
is_expert β Whether for expert layers in MoE.
disable_sequence_parallel_comm β Disable sequence parallel communication.
dropout_recompute β Use recomputation for dropout.
**kwargs β Additional keyword arguments.
- _get_activation_fn(activation: str) torch.nn.Module #
Get activation function by name.
- Parameters:
activation β Name of the activation function.
- Returns:
PyTorch activation module.
.. note:: Defaults to Identity if activation name is not recognized.
- _get_init_fn(
- init_method: str,
Get initialization function by method name.
- Parameters:
init_method β Name of the initialization method.
- Returns:
Initialization function.
- Raises:
NotImplementedError β If init_method is not supported.
- forward(x: torch.Tensor) torch.Tensor #
Forward pass of the parallel linear adapter.
Performs the adaptation computation with proper handling of parallel communication patterns, dropout, and expert routing for MoE scenarios.
- Parameters:
x β Input tensor.
- Returns:
Adapted output tensor with scaling applied.
- sharded_state_dict(
- prefix: str = '',
- sharded_offsets: Tuple = (),
- metadata: Optional[Dict] = None,
Create sharded state dictionary for distributed checkpointing.
Special treatment is given to the linear_fc1 adapter since tensor parallelism is sharded separately for the two logical matrices (gate and up) in SwiGLU.
- Parameters:
prefix β Prefix for parameter names.
sharded_offsets β Offsets for sharded parameters.
metadata β Additional metadata for sharding.
- Returns:
Sharded state dictionary for distributed checkpointing.