Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
A Comparison of Performant and Canonical LoRA Variants#
There are two variants of LoRA implemented in NeMo 2.0: “performant LoRA” (LoRA
) and “canonical LoRA”
(CanonicalLoRA
).
The distinction comes from the fact that Megatron Core optimizes the implementation of the following two linear modules by fusing multiple linear layers into one layer. When these layers are adapted with LoRA, the performant version also uses only one adapter for the linear module. The two linear modules are:
linear_qkv
: The projection matrix in self attention that transforms hidden state to query, key and value. Megatron Core fuses these three projection matrices into a single matrix to efficiently parallelize the matrix multiplication. Hence, performant LoRA applies a single adapter to the qkv projection matrix, whereas canonical LoRA applies three adapters.linear_fc1
: The first linear layer in the MLP module before the intermediate activation. For gated linear activations, Megatron Core fuses the up and gate projection matrices into a single matrix for efficient parallelization. Hence, performant LoRA applies a single adapter to the up and gate projection matrices, whereas canonical LoRA applies two adapters.
The following two figures illustrate the difference between canonical and performant LoRA, using the linear_qkv
layer as an example. Canonical LoRA runs three adapters sequentially, while performant LoRA runs one adapter.


Canonical LoRA conforms to the Hugging Face implementation [1], though it is slower in comparison since it performs several matrix multiplications sequentially, as described above. Performant LoRA has fewer parameters than canonical LoRA and can often achieve the same level of accuracy as canonical LoRA.
Though not immediately apparent, performant LoRA is mathematically equivalent to canonical LoRA when the
\(A_q \ A_k \ A_v\) matrices are tied (i.e. forced to share the same weight during training) in linear_qkv
,
and similarly when the \(A_{up} \ A_{gate}\) matrices are tied in linear_fc1
.
The proof for linear_qkv
can be found below.
Proof that performant LoRA is equivalent to canonical LoRA with weights tied
Let \([x \quad y]\) denote matrix concatenation. (In NeMo, this concatenation is done in an interleaved fashion, but this does not affect the proof below.)
Let \(A_q = A_k = A_v = A_{qkv}\) (weight tying)
Then
Note: dimensions of weight matrices are as follows:
- Where
\(n_q\): Number of attention heads (
num_attention_heads
).\(n_{kv}\): Number of key value heads (
num_query_groups
). Note that if grouped query attention (GQA) is not used, \(n_{kv} = n_q\).\(h\): Transformer hidden size (
hidden_size
).\(d\): Transformer head dimension (
kv_channels
).\(r\): LoRA rank.
The implication of this observation is as follows:
Hugging Face LoRA can only be imported to NeMo canonical LoRA
NeMo canonical LoRA can be exported to Hugging Face LoRA.
NeMo performant LoRA can also be exported to Hugging Face LoRA by replicating the weights of \(A_{qkv}\) three times.