LoRA (Low-Rank Adaptation)#
NeMo RL supports LoRA for parameter-efficient fine-tuning. LoRA reduces the number of trainable parameters by learning low-rank update matrices for selected linear layers while keeping the base model frozen. This lowers memory usage and checkpoint size, and enables serving many task-specific adapters on top of a single shared base model.
This page is the single reference for LoRA in NeMo RL. It covers backend support, a side-by-side schema comparison, backend-specific config examples, and links to the algorithm guides and checkpoint-export tooling.
For the theory behind the method, see LoRA: Low-Rank Adaptation of Large Language Models.
Backend Support#
LoRA is implemented on two training backends, each with its own config schema:
Backend |
Config path |
Notes |
|---|---|---|
DTensor (Automodel) |
|
Requires DTensor v2 ( |
Megatron Core |
|
Requires |
LoRA is supported across the SFT, GRPO, and DPO algorithms on both backends.
Note
On the DTensor backend, Triton-optimized LoRA kernels are only used in the DTensor v2 path.
Automodel does not support Triton kernels when tensor_parallel_size > 1, so set
use_triton: false in that case (see the parameter details below).
Schema Comparison#
The two backends share most of their fields but differ in a few backend-specific options. The table below maps equivalent fields and highlights the differences.
Concept |
DTensor ( |
Megatron ( |
|---|---|---|
Enable LoRA |
|
|
Modules to adapt |
|
|
Modules to skip |
|
|
Adapt all linear layers |
|
(implicit: empty |
Rank (r) |
|
|
Scaling factor |
|
|
Dropout probability |
|
|
Dropout position |
|
|
A-matrix init |
|
|
B-matrix init |
(always zero) |
|
Optimized kernels |
|
— |
Adapter dtype |
(follows base layer) |
|
Experimental A2A comm |
— |
|
The effective learning-rate multiplier for the adapter is alpha / dim on both backends.
DTensor Configuration#
LoRA settings live under policy.dtensor_cfg.lora_cfg:
policy:
dtensor_cfg:
_v2: true # LoRA requires DTensor v2
lora_cfg:
enabled: False # Set to True to enable LoRA fine-tuning
target_modules: [] # List of module names to apply LoRA
exclude_modules: [] # List of module names to exclude from LoRA
match_all_linear: true # Apply LoRA to all linear layers
dim: 8 # LoRA rank (r): controls adaptation capacity
alpha: 32 # LoRA scaling factor (effective lr = alpha/dim)
dropout: 0.0 # Dropout probability for LoRA layers
dropout_position: "post" # Dropout position: "pre" or "post"
lora_A_init: "xavier" # Initialization method: "xavier" or "uniform"
use_triton: true # Use Triton-optimized kernels (DTensor v2 path)
DTensor Parameter Details#
enabled(bool): Whether to enable LoRA training.target_modules(list): Specific module names to apply LoRA. Empty withmatch_all_linear=trueapplies to all linear layers.exclude_modules(list): Module names to exclude from LoRA.match_all_linear(bool): Whentrue, applies LoRA to all linear layers (overridestarget_modules).dim(int): LoRA rank (r). Lower values = fewer parameters but less capacity. Typical: 4, 8, 16, 32, 64.alpha(int): LoRA scaling factor. Effective learning rate multiplier =alpha/dim. Typical: 16, 32, 64.dropout(float): Dropout probability for regularization.dropout_position(str): Apply dropout before ("pre") or after ("post") LoRA.lora_A_init(str): Initialization method for the LoRA A matrix ("xavier"or"uniform"). The B matrix is always initialized to zero.use_triton(bool): Use Triton-optimized kernels for better performance. Used for DTensor v2 only. Note: Automodel does not support Triton for TP > 1. Set tofalsewhentensor_parallel_size > 1to avoid compatibility issues.
Megatron Configuration#
LoRA settings live under policy.megatron_cfg.peft:
policy:
megatron_cfg:
peft:
enabled: false # Set to True to enable LoRA fine-tuning
target_modules: [] # List of module names to apply LoRA, defaults to all linear layers
exclude_modules: [] # List of module names not to apply LoRA
dim: 32 # LoRA rank (r): controls adaptation capacity
alpha: 32 # LoRA scaling factor (effective lr = alpha/dim)
dropout: 0.0 # Dropout probability for LoRA layers
dropout_position: "pre" # Dropout position: "pre" or "post"
lora_A_init_method: "xavier" # Initialization method for lora A: "xavier" or "uniform"
lora_B_init_method: "zero" # Initialization method for lora B: "zero"
a2a_experimental: false # Enables the experimental All-to-All (A2A) communication strategy
lora_dtype: None # Adapter weights dtype
Megatron Parameter Details#
enabled(bool): Whether to enable LoRA training.target_modules(list): Specific module names to apply LoRA. Defaults to all linear layers if the list is left empty. Example:['linear_qkv', 'linear_proj', 'linear_fc1', 'linear_fc2'].linear_qkv: Apply LoRA to the fused linear layer used for query, key, and value projections in self-attention.linear_proj: Apply LoRA to the linear layer used for projecting the output of self-attention.linear_fc1: Apply LoRA to the first fully-connected layer in MLP.linear_fc2: Apply LoRA to the second fully-connected layer in MLP.
Target modules can also contain wildcards. For example, you can specify
target_modules=['*.layers.0.*.linear_qkv', '*.layers.1.*.linear_qkv']to add LoRA to onlylinear_qkvon the first two layers.exclude_modules(list, optional): A list of module names not to apply LoRA. It will match allnn.Linear&nn.Linear-adjacent modules whose name does not match any string inexclude_modules. If used, it requirestarget_modulesto be an empty list orNone.dim(int): LoRA rank (r). Lower values = fewer parameters but less capacity. Typical: 4, 8, 16, 32, 64.alpha(int): LoRA scaling factor. Effective learning rate multiplier =alpha/dim. Typical: 16, 32, 64.dropout(float): Dropout probability for regularization, defaults to 0.0.dropout_position(str): Apply dropout before ("pre") or after ("post") LoRA.lora_A_init_method(str): Initialization method forlora_A(choices:['xavier', 'uniform']), defaults toxavier.lora_B_init_method(str): Initialization method for the low-rank matrix B. Defaults to"zero".a2a_experimental(bool): Enables the experimental All-to-All (A2A) communication strategy. Defaults toFalse.lora_dtype(torch.dtype): Adapter weights dtype. By default it followsorig_linear’s dtype, but for quantized weights (e.g. 4-bit) it must be specified explicitly.
Usage by Algorithm#
SFT#
The config uses the DTensor backend by default, so DTensor LoRA only requires enabling the flag:
uv run examples/run_sft.py policy.dtensor_cfg.lora_cfg.enabled=true
To use the Megatron backend, switch backends and enable the PEFT block:
uv run examples/run_sft.py \
--config examples/configs/sft.yaml \
policy.dtensor_cfg.enabled=false \
policy.megatron_cfg.enabled=true \
policy.megatron_cfg.peft.enabled=true
See the SFT guide for the full SFT workflow.
GRPO#
GRPO supports LoRA on both backends. Enable the DTensor adapter with:
uv run examples/run_grpo.py policy.dtensor_cfg.lora_cfg.enabled=true
The DTensor GRPO LoRA path uses a merge-weight approach: during generation, LoRA adapter weights are merged into the base linear weights. This improves performance at the cost of a small train/inference mismatch that we consider acceptable. If you require strict train/inference parity, use the split-weight variant branch, which may trade off some performance. For a comparison between merge-weight and split-weight, see PR 1797: Support lora in dtensor grpo workflow by merging weight.
See the GRPO guide for the full GRPO workflow.
DPO#
DPO fully supports LoRA on both the DTensor and Megatron backends, using the same
lora_cfg / peft config blocks as SFT and GRPO. There is no dedicated DPO LoRA recipe;
enable it on an existing DPO config via an override. For the DTensor backend:
uv run examples/run_dpo.py policy.dtensor_cfg.lora_cfg.enabled=true
For the Megatron backend:
uv run examples/run_dpo.py \
--config examples/configs/dpo.yaml \
policy.dtensor_cfg.enabled=false \
policy.megatron_cfg.enabled=true \
policy.megatron_cfg.peft.enabled=true
See the DPO guide for the full DPO workflow.
Example Recipes#
Ready-to-run LoRA recipes live under examples/configs/recipes/llm/:
Recipe |
Algorithm |
Backend |
|---|---|---|
SFT |
DTensor |
|
SFT |
Megatron |
|
SFT |
DTensor |
|
GRPO |
DTensor |
|
GRPO |
Megatron |
|
GRPO |
DTensor |
|
GRPO |
Megatron |
DPO LoRA is supported on both backends but ships no dedicated recipe — enable it on an existing DPO config with the overrides shown in Usage by Algorithm → DPO.
Exporting a LoRA Checkpoint to Hugging Face Format#
After training with LoRA on the Megatron backend, the convert_lora_to_hf.py script supports
two export modes:
Merged: fold the adapter into the base model and export a single standalone Hugging Face checkpoint for inference or evaluation.
Adapter-only: export only the adapter weights in Hugging Face PEFT format, keeping the base model separate (e.g. for use with vLLM’s LoRA support).
See the Checkpointing documentation for full usage details and examples.