A Comparison of Performant and Canonical LoRA Variants#

There are two variants of LoRA implemented in NeMo 2.0: “performant LoRA” (LoRA) and “canonical LoRA” (CanonicalLoRA).

The distinction comes from the fact that Megatron Core optimizes the implementation of the following two linear modules by fusing multiple linear layers into one layer. When these layers are adapted with LoRA, the performant version also uses only one adapter for the linear module. The two linear modules are:

linear_qkv: The projection matrix in self attention that transforms hidden state to query, key and value. Megatron Core fuses these three projection matrices into a single matrix to efficiently parallelize the matrix multiplication. Hence, performant LoRA applies a single adapter to the qkv projection matrix, whereas canonical LoRA applies three adapters.
linear_fc1: The first linear layer in the MLP module before the intermediate activation. For gated linear activations, Megatron Core fuses the up and gate projection matrices into a single matrix for efficient parallelization. Hence, performant LoRA applies a single adapter to the up and gate projection matrices, whereas canonical LoRA applies two adapters.

The following two figures illustrate the difference between canonical and performant LoRA, using the linear_qkv layer as an example. Canonical LoRA runs three adapters sequentially, while performant LoRA runs one adapter.

Canonical LoRA conforms to the Hugging Face implementation [1], though it is slower in comparison since it performs several matrix multiplications sequentially, as described above. Performant LoRA has fewer parameters than canonical LoRA and can often achieve the same level of accuracy as canonical LoRA.

Though not immediately apparent, performant LoRA is mathematically equivalent to canonical LoRA when the \(A_q \ A_k \ A_v\) matrices are tied (i.e. forced to share the same weight during training) in linear_qkv, and similarly when the \(A_{up} \ A_{gate}\) matrices are tied in linear_fc1. The proof for linear_qkv can be found below.

The implication of this observation is as follows:

Hugging Face LoRA can only be imported to NeMo canonical LoRA
NeMo canonical LoRA can be exported to Hugging Face LoRA.
NeMo performant LoRA can also be exported to Hugging Face LoRA by replicating the weights of \(A_{qkv}\) three times.