Adapter Components

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide Adapter Components

Adapters can be considered as any set of parameters that are added to a pre-existing module/model. In our case, we currently support the standard adapter in literature, more advanced adapter modules are being researched and can potentially be supported by NeMo.

An adapter module can be any pytorch module, but it must follow certain straightforward requirements -

The model accepts an input of some input dimension, and its output must match this dimension.
Ideally, the module is initialized such that the output of the adapter when initialized is such that it does not modify the original input. This allows the model to produce the same output results, even when additional parameters have been added.

According to Junxian et al [1], we can consider an adapter being represented as three components -

Functional form - the trainable parameters that will modify the input
Insertion form - Where the adapter outputs are integrated with the original input. The input to the adapters can be the last output of the layer, the input to some attention layer, or even the original input to the module itself (before even the modules forward pass).
Composition function - How the adapters outputs are integrated with the inputs. It can be as simple as residual addition connection, or concatenation, or point-wise multiplication etc.

Functional Form - Adapter Networks

Adapter modules represent the functional form of the adapter. We discuss an example of a most commonly used adapter module found in literature, titled the LinearAdapter (or Houlsby Adapter) [2].

Note

All adapter modules must extend AdapterModuleUtil and should ideally have an equivalent DataClass config for easy instantiation !

class nemo.collections.common.parts.adapter_modules.AdapterModuleUtil

Bases: nemo.core.classes.mixins.access_mixins.AccessMixin

Base class of Adapter Modules, providing common functionality to all Adapter Modules.

setup_adapter_strategy(adapter_strategy: Optional[nemo.core.classes.mixins.adapter_mixin_strategies.AbstractAdapterStrategy])

Setup adapter strategy of this class, enabling dynamic change in the way the adapter output is merged with the input.

When called successfully, will assign the variable adapter_strategy to the module.

Parameters

get_default_strategy_config() → dataclasses.dataclass

adapter_unfreeze()

class nemo.collections.common.parts.adapter_modules.LinearAdapter(*args: Any, **kwargs: Any)

Bases: torch.nn.Module, nemo.collections.common.parts.adapter_modules.AdapterModuleUtil

Simple Linear Feedforward Adapter module with LayerNorm and singe hidden layer with activation function. Note: The adapter explicitly initializes its final layer with all zeros in order to avoid affecting the original model when all adapters are disabled.

Parameters

Insertion Form - Module Adapters

Adapter modules can be integrated into many different locations of a given module. For example, it is possible to have an adapter that affects only the outputs of the final layer in each module. We can also have a Parallel Adapter [1] that operates at the input of the module itself, in parallel to the forward pass of the module. Yet another insertion location is inside the Multi Head Attention Layers.

On top of this, while adapters are commonly used only in the layers containing the most parameters (say the Encoder of a network), some models can support adapters in multiple locations (Encoder-Decoder architecture for Language Models, Machine Translation, or even Encoder-Decoder-Joint for ASR with Transducer Loss). As such, NeMo utilizes the concept of Module Adapters.

Module Adapters are very simply defined when adding an adapter - by specifying the module that the adapter should be inserted into.

Copy
Copied!

            
            # Get the list of supported modules / locations in a adapter compatible Model
print(model.adapter_module_names)  # assume ['', 'encoder', 'decoder']

# When calling add_adapter, specify the module name in the left of the colon symbol, and the adapter name afterwords.
# The adapter is then directed to the decoder module instead of the default / encoder module.
model.add_adapter("decoder:first_adapter", cfg=...)

You might note that model.adapter_module_names can sometimes return '' as one of the supported module names - this refers to the “default module”. Generally we try to provide the default as the most commonly used adapter in literature - for example, Encoder adapters in NLP/NMT/ASR.

Composition Function - Adapter Strategies

Finally, we discuss how to compose the input and output of adapter modules. In order to generalize this step, we construct Adapter Strategies. A strategy is any class (not torch.nn.Module!) that extends AbstractAdapterStrategy, and provides a forward() method that accepts a specific signature of the inputs and produces an output tensor which combines the input and output with some specific method.

We discuss a simple residual additional connection strategy below - that accepts an input to the adapter and an adapters output and simply adds them together. It also supports stochastic_depth which enables adapters to be dynamically switched off during training, making training more robust.

class nemo.core.classes.mixins.adapter_mixin_strategies.AbstractAdapterStrategy

Bases: abc.ABC

forward(input: torch.Tensor, adapter: torch.nn.Module, *, module: AdapterModuleMixin)

Forward method that defines how the output of the adapter should be merged with the input, or if it should be merged at all.

Also provides the module that called this strategy - thereby allowing access to all other adapters in the calling module. This can be useful if one adapter is a meta adapter, that combines the outputs of various adapters. In such a case, the input can be forwarded across all other adapters, collecting their outputs, and those outputs can then be merged via some strategy. For example, refer to :

[AdapterFusion: Non-Destructive Task Composition for Transfer Learning](https://arxiv.org/abs/2005.00247)
[Exploiting Adapters for Cross-lingual Low-resource Speech Recognition](https://arxiv.org/abs/2105.11905)

Parameters
Returns

class nemo.core.classes.mixins.adapter_mixin_strategies.ResidualAddAdapterStrategy(stochastic_depth: float = 0.0, l2_lambda: float = 0.0)

Bases: nemo.core.classes.mixins.adapter_mixin_strategies.AbstractAdapterStrategy

An implementation of residual addition of an adapter module with its input. Supports stochastic depth regularization.

forward(input: torch.Tensor, adapter: torch.nn.Module, *, module: AdapterModuleMixin)

A basic strategy, comprising of a residual connection over the input, after forward pass by the underlying adapter.

Parameters
Returns

compute_output(input: torch.Tensor, adapter: torch.nn.Module, *, module: AdapterModuleMixin) → torch.Tensor

Compute the output of a single adapter to some input.

Parameters
Returns

apply_stochastic_depth(output: torch.Tensor, input: torch.Tensor, adapter: torch.nn.Module, *, module: AdapterModuleMixin)

Compute and apply stochastic depth if probability is greater than 0.

Parameters
Returns

compute_auxiliary_losses(output: torch.Tensor, input: torch.Tensor, adapter: torch.nn.Module, *, module: AdapterModuleMixin)

Compute any auxiliary losses and preserve it in the tensor registry.

Parameters

References

[1](1,2)

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. 2021. URL: https://arxiv.org/abs/2110.04366, doi:10.48550/ARXIV.2110.04366.

[2]

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, 2790–2799. PMLR, 2019.

Previous Adapters

Next Adapters API