Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PEFT) with NeMo AutoModel

Introduction

Pretrained language models are general-purpose: they know a lot about language but nothing about your particular domain, terminology, or task. Fine-tuning bridges that gap — you fine-tune the model on your own examples so it produces answers that are accurate and relevant for your use case, without the cost of training a model from scratch. The result is a model optimized for your data that you can evaluate, publish, and deploy. This guide walks you through that process end-to-end with NeMo AutoModel — from installation through training, evaluation, and deployment — using Meta LLaMA 3.2 1B and the SQuAD v1.1 dataset as a running example.

NeMo AutoModel supports two fine-tuning modes:

Supervised Fine-Tuning (SFT) updates all model parameters. Use SFT when you need maximum accuracy and have sufficient compute.
Parameter-Efficient Fine-Tuning (PEFT) using LoRA freezes the base model and trains small low-rank adapters. PEFT reduces trainable parameters to less than 1% of the original model, lowering memory and storage costs.

Workflow Overview

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ 1. Install   │--->│ 2. Configure │--->│  3. Train    │--->│ 4. Inference │--->│ 5. Evaluate  │--->│ 6. Publish   │--->│  7. Deploy   │
│              │    │              │    │              │    │              │    │              │    │  (optional)  │    │  (optional)  │
│ pip install  │    │ YAML config  │    │ automodel CLI│    │ HF generate  │    │ Val loss +   │    │ HF Hub       │    │ vLLM serving │
│ or Docker    │    │ Choose SFT   │    │ or torchrun  │    │ API          │    │ lm-eval-     │    │ upload       │    │              │
│              │    │ or PEFT      │    │              │    │              │    │ harness      │    │              │    │              │
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘

Step	Section	SFT	PEFT
1. Install	Install NeMo AutoModel	Same	Same
2. Configure	Configure Your Training Recipe	YAML without `peft:` section	YAML with `peft:` section
3. Train	Fine-Tune the Model	Same command for both modes	Same command for both modes
4. Inference	Run Inference	Load consolidated checkpoint directly	Load base model + adapter
5. Evaluate	Evaluate the Fine-Tuned Model	Validation loss during training; lm-eval-harness post-training	Same
6. Publish	Publish to HF Hub	Upload `model/consolidated/`	Upload `model/` (adapter only)
7. Deploy	Deploy with vLLM	`vllm.LLM(model=...)`	`vLLMHFExporter` with `--lora-model`

Install NeMo AutoModel

$ pip3 install nemo-automodel

Alternatively, if you run into dependency or driver issues, use the pre-built Docker container:

$ docker pull nvcr.io/nvidia/nemo-automodel:26.06.00
$ docker run --gpus all -it --rm --shm-size=8g -v $(pwd)/checkpoints:/tmp/checkpoints/ nvcr.io/nvidia/nemo-automodel:26.06.00

Docker containers are ephemeral — files written inside the container are lost when it stops. The -v flag in the docker run command above bind-mounts a local checkpoints/ directory into the container so that saved checkpoints persist across runs. For more details, see Save Checkpoints When Using Docker.

For the full set of installation methods, see the installation guide.

Configure Your Training Recipe

Training is configured through a YAML config file with three required sections — model, dataset, and step_scheduler — plus an optional peft section. The sections below walk through each one. For the complete copy-pastable file, see Full Config YAML.

Under the hood, both SFT and PEFT are executed by a recipe: a self-contained Python class that wires together model loading, dataset preparation, training, checkpointing, and logging. The fine-tuning recipe is TrainFinetuneRecipeForNextTokenPrediction. The config file tells the recipe what to build; the recipe decides how to build it.

How the Config System Works

NeMo AutoModel configs use a convention borrowed from Hydra: the special _target_ key tells the framework which Python class or function to call, and every other key in the same YAML block is passed as a keyword argument to that call. For example:

1 optimizer:
2   _target_: torch.optim.Adam
3   lr: 1.0e-5
4   weight_decay: 0

is equivalent to writing this Python code:

1 from torch.optim import Adam
2 
3 optimizer = Adam(lr=1.0e-5, weight_decay=0)

The _target_ value is a dotted Python import path: the same string you would use in an import statement. The framework resolves it at runtime by importing the module and looking up the attribute. This means you can point _target_ at any class constructor or factory function, and the remaining keys become its arguments.

To discover which parameters a section accepts, look up the Python signature of its _target_. For instance, torch.optim.Adam accepts lr, betas, eps, and weight_decay — those are the keys you can set in the YAML.

From YAML to running code. Here is the path a config takes through the framework:

finetune_config.yaml
        │
        ▼
  ┌──────────────┐     load_yaml_config() parses the file into
  │  ConfigNode  │◄─── a tree of ConfigNode objects, one per
  └──────┬───────┘     YAML section.
         │
         ▼
  ┌──────────────┐     The recipe's setup() method reads
  │   Recipe     │◄─── each section from the ConfigNode tree
  │   setup()    │     and passes it to the matching builder.
  └──────┬───────┘
         │
    ┌────┴─────────────────────────────────┐
    ▼            ▼            ▼            ▼
build_model  build_optimizer build_dataloader build_loss_module ...
    │            │            │            │
    ▼            ▼            ▼            ▼
cfg.model     cfg.optimizer cfg.dataset   cfg.loss_fn
 .instantiate() .instantiate() .instantiate() .instantiate()
    │            │            │            │
    ▼            ▼            ▼            ▼
 Resolves      Resolves     Resolves     Resolves
 _target_,     _target_,    _target_,    _target_,
 calls it      calls it     calls it     calls it
 with kwargs   with kwargs  with kwargs  with kwargs

Each builder function calls .instantiate() on its config section. .instantiate() does two things:

Resolves _target_ — imports the Python path and obtains the callable (class or function).
Calls it — passes every other key in the section as a keyword argument.

Nested _target_ blocks (like collate_fn inside dataloader) are recursively instantiated the same way.

The recipe key. Every config file includes a top-level recipe key that tells the CLI which recipe class to run. You can write it as a short name or as a fully-qualified Python path — both resolve to the same class:

1 # Short name (the CLI looks up the class automatically)
2 recipe: TrainFinetuneRecipeForNextTokenPrediction
3 
4 # Fully-qualified path (used as-is)
5 recipe: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction

The short name form is a convenience — the CLI scans all recipe modules under nemo_automodel.recipes and matches the bare class name. If you invoke the recipe script directly with torchrun instead of the automodel CLI, the recipe key is not required because the script itself is the recipe.

Not every section uses _target_. Some sections like step_scheduler, distributed, and checkpoint are plain key-value groups consumed directly by the recipe — they control training schedule, parallelism strategy, and checkpoint behavior without instantiating a Python object.

Model

1 model:
2   _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
3   pretrained_model_name_or_path: meta-llama/Llama-3.2-1B

Key	Role
`_target_`	Points to `NeMoAutoModelForCausalLM.from_pretrained` — a factory method that downloads (or loads from cache) a pretrained Hugging Face model and wraps it with NeMo distributed-training support.
`pretrained_model_name_or_path`	A keyword argument to `from_pretrained`. Any argument that `from_pretrained` accepts can be added here (e.g., `cache_dir`, `torch_dtype`).

This guide uses Meta Llama 3.2 1B as a running example. Replace pretrained_model_name_or_path with any supported Hugging Face model ID.

About Llama 3.2 1B

Llama is a family of decoder-only transformer models developed by Meta. The 1B variant is a compact model suitable for research and edge deployment, featuring RoPE positional embeddings, grouped-query attention (GQA), and SwiGLU activations.

Accessing Gated Models

Some Hugging Face models are gated. If the model page shows a “Request access” button:

Log in with your Hugging Face account and accept the license.
Ensure the token you use (from huggingface-cli login or HF_TOKEN) belongs to the approved account.

Pulling a gated model without an authorized token triggers a 403 error.

Dataset

1 dataset:
2   _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
3   dataset_name: rajpurkar/squad  # HF-Hub ID used to pull the dataset
4   split: train
5 
6 validation_dataset:
7   _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
8   dataset_name: rajpurkar/squad
9   split: validation

Key	Role
`_target_`	Points to `make_squad_dataset` — a factory function that downloads the SQuAD dataset, tokenizes it, and returns a `torch.utils.data.Dataset`. To use a different dataset, change `_target_` to a different factory function (see Integrate Your Own Text Dataset).
`dataset_name`, `split`	Keyword arguments passed to `make_squad_dataset`. Each dataset factory defines its own parameters — check the function signature to see what’s available.

This guide uses SQuAD v1.1 as a running example. Swap the dataset by changing _target_ and the dataset arguments — see Integrate Your Own Text Dataset and Dataset Overview: LLM, VLM, and Retrieval Datasets.

About SQuAD v1.1

The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset where each example consists of a Wikipedia passage, a question, and an answer span. SQuAD v1.1 guarantees all questions are answerable from the context, making it suitable for straightforward fine-tuning.

Example:

1 {
2     "context": "Architecturally, the school has a Catholic character. ...",
3     "question": "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?",
4     "answers": { "text": ["Saint Bernadette Soubirous"], "answer_start": [515] }
5 }

PEFT (Optional)

1 peft:
2   _target_: nemo_automodel.components._peft.lora.PeftConfig
3   target_modules: "*.proj"  # glob pattern matching linear layer FQNs
4   dim: 8                    # low-rank dimension of the adapters
5   alpha: 32                 # scaling factor for learned weights

Key	Role
`_target_`	Points to `PeftConfig` — a dataclass that describes which layers to adapt and how. Unlike the model and dataset sections, this instantiation produces a config object, not the adapter itself. The recipe passes the resulting `PeftConfig` into `build_model`, which applies LoRA adapters to the model.
`target_modules`	A glob pattern matched against fully-qualified layer names (e.g. `"*.proj"` matches every layer whose name ends in `proj`).
`dim`	The low-rank dimension r — controls adapter capacity. Larger values learn more but use more memory.
`alpha`	Scaling factor applied to the adapter output (`alpha / dim`). Higher values give adapters more influence during training.

Including a peft: section enables LoRA fine-tuning. Remove it entirely to run SFT instead — see Switch Between SFT and PEFT.

QLoRA (Quantized Low-Rank Adaptation)

If GPU memory is a constraint, QLoRA combines LoRA with 4-bit NormalFloat (NF4) quantization to reduce memory usage by up to 75% compared to full-parameter SFT in 16-bit precision, while maintaining comparable quality to standard LoRA.

To enable QLoRA, add a quantization: section alongside the peft: section in your config. Note two differences from the standard PEFT config above: target_modules uses the broader "*_proj" pattern to apply LoRA to all projection layers (wider coverage compensates for precision loss from 4-bit weights), and dim is increased from 8 to 16 for additional adapter capacity.

1 model:
2   _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
3   pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
4 
5 peft:
6   _target_: nemo_automodel.components._peft.lora.PeftConfig
7   target_modules: "*_proj"  # broader glob than "*.proj" to cover all projection layers
8   dim: 16                   # LoRA rank (higher than default to offset quantization)
9   alpha: 32                # scaling factor
10   dropout: 0.1             # LoRA dropout rate
11 
12 quantization:
13   load_in_4bit: True                   # enable 4-bit quantization
14   load_in_8bit: False                  # use 4-bit, not 8-bit
15   bnb_4bit_compute_dtype: bfloat16     # compute dtype
16   bnb_4bit_use_double_quant: True      # double quantization for extra savings
17   bnb_4bit_quant_type: nf4             # NormalFloat quantization type
18   bnb_4bit_quant_storage: bfloat16     # storage dtype for quantized weights

Training Schedule

1 step_scheduler:
2   num_epochs: 1     # Will train over the dataset once.

Unlike the sections above, step_scheduler has no _target_ — it is not instantiated into a Python object. Instead, the recipe reads its keys directly to control the training loop (how many epochs to run, when to checkpoint, when to validate). This is typical of sections that configure behavior rather than components.

All other settings (distributed strategy, optimizer, checkpointing, logging) use sensible defaults. See the Full Configuration Reference to customize them.

Most example recipes use bf16 training by default for memory and throughput. If you are running long fine-tuning, especially full-parameter SFT, and need higher-precision optimizer state, configure it explicitly instead of assuming it from the mixed-precision compute policy. See the mixed-precision training guide for the recommended TE and torch AdamW patterns.

Full Config YAML

finetune_config.yaml (Click to Expand)

Save as finetune_config.yaml. This config runs PEFT (LoRA). To run SFT instead, remove the peft: section. For production-ready examples, see the hosted configs: Llama 3.2 1B SFT and Llama 3.2 1B PEFT.

1 model:
2   _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
3   pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
4 
5 peft:
6   _target_: nemo_automodel.components._peft.lora.PeftConfig
7   target_modules: "*.proj"
8   dim: 8
9   alpha: 32
10 
11 dataset:
12   _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
13   dataset_name: rajpurkar/squad
14   split: train
15 
16 validation_dataset:
17   _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
18   dataset_name: rajpurkar/squad
19   split: validation
20 
21 step_scheduler:
22   num_epochs: 1

Fine-Tune the Model

You can run the recipe using the AutoModel CLI or directly with torchrun (advanced).

$ automodel --nproc-per-node=8 finetune_config.yaml

The --nproc-per-node=8 flag specifies the number of GPUs per node. Adjust as needed (for a single GPU, omit the --nproc-per-node option).

Invoke the Recipe Script Directly (Advanced)

Alternatively, you can invoke the recipe script directly using torchrun, as shown below.

$ torchrun --nproc-per-node=8 nemo_automodel/recipes/llm/train_ft.py -c finetune_config.yaml

Sample Output

Running the recipe with the automodel app or by invoking the recipe script directly produces the following log:

$ automodel finetune_config.yaml
INFO:nemo_automodel.cli.app:Config: finetune_config.yaml
INFO:nemo_automodel.cli.app:Recipe: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction
INFO:nemo_automodel.cli.app:Launching job interactively (local)
cfg-path: finetune_config.yaml
INFO:root:step 4 | epoch 0 | loss 1.5514 | grad_norm 102.0000 | mem: 11.66 GiB | tps 6924.50
INFO:root:step 8 | epoch 0 | loss 0.7913 | grad_norm 46.2500 | mem: 14.58 GiB | tps 9328.79
Saving checkpoint to checkpoints/epoch_0_step_10
INFO:root:step 12 | epoch 0 | loss 0.4358 | grad_norm 23.8750 | mem: 15.48 GiB | tps 9068.99
INFO:root:step 16 | epoch 0 | loss 0.2057 | grad_norm 12.9375 | mem: 16.47 GiB | tps 9148.28
INFO:root:step 20 | epoch 0 | loss 0.2557 | grad_norm 13.4375 | mem: 12.35 GiB | tps 9196.97
Saving checkpoint to checkpoints/epoch_0_step_20
INFO:root:[val] step 20 | epoch 0 | loss 0.2469

Each log line reports the current loss, gradient norm, peak GPU memory, and tokens per second (TPS). Small fluctuations between steps (e.g., 0.2057 to 0.2557 above) are normal — look at the overall downward trend rather than individual values.

Checkpoint Contents

Checkpoints are saved as Hugging Face-compatible safetensors. By default, SFT writes sharded model weights plus a generated model/consolidate.sh helper; run the helper after training to create model/consolidated/ for Transformers, vLLM, lm-eval-harness, and other Hugging Face ecosystem tools. You can also set save_consolidated: final to export consolidated HF weights only for the final checkpoint. Use save_consolidated: every (or legacy true) only when you intentionally want inline HF export at every checkpoint save. PEFT checkpoints contain only the adapter weights (~MBs instead of GBs) and are saved directly under model/. They do not use model/consolidate.sh — at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment).

Checkpoint Directory Structure

SFT checkpoint:

$ $ tree checkpoints/epoch_0_step_10/
$ checkpoints/epoch_0_step_10/
$ ├── config.yaml
$ ├── dataloader.pt
$ ├── model
$ │   ├── consolidate.sh
$ │   ├── shard-00001-model-00001-of-00001.safetensors
$ │   └── shard-00002-model-00001-of-00001.safetensors
$ ├── optim
$ │   ├── __0_0.distcp
$ │   └── __1_0.distcp
$ ├── rng.pt
$ └── step_scheduler.pt
$ 
$ 3 directories, 9 files

PEFT checkpoint:

$ $ tree checkpoints/epoch_0_step_10/
$ checkpoints/epoch_0_step_10/
$ ├── dataloader.pt
$ ├── config.yaml
$ ├── model
$ │   ├── adapter_config.json
$ │   ├── adapter_model.safetensors
$ │   └── automodel_peft_config.json
$ ├── optim
$ │   ├── __0_0.distcp
$ │   └── __1_0.distcp
$ ├── rng.pt
$ └── step_scheduler.pt
$ 
$ 2 directories, 8 files

Run Inference

Inference uses the Hugging Face generate API. Because exported SFT checkpoints are self-contained while PEFT checkpoints store only adapter weights (see Checkpoint Contents), the loading procedure differs between the two modes.

SFT Inference

If save_consolidated: false, first run the generated helper for the checkpoint you want to load:

$ bash checkpoints/epoch_0_step_10/model/consolidate.sh

The exported SFT checkpoint at model/consolidated/ is a complete Hugging Face model and can be loaded directly:

1 import torch
2 from transformers import AutoModelForCausalLM, AutoTokenizer
3 
4 ckpt_path = "checkpoints/epoch_0_step_10/model/consolidated"
5 tokenizer = AutoTokenizer.from_pretrained(ckpt_path)
6 model = AutoModelForCausalLM.from_pretrained(ckpt_path)
7 
8 device = "cuda" if torch.cuda.is_available() else "cpu"
9 model.to(device)
10 
11 prompt = (
12     "Context: Architecturally, the school has a Catholic character. "
13     "Atop the Main Building's gold dome is a golden statue of the Virgin Mary. "
14     "Immediately in front of the Main Building and facing it, is a copper statue of Christ "
15     "with arms upraised with the legend 'Venite Ad Me Omnes'.\n\n"
16     "Question: What is atop the Main Building?\n\n"
17     "Answer:"
18 )
19 inputs = tokenizer(prompt, return_tensors="pt").to(device)
20 output = model.generate(**inputs, max_new_tokens=50)
21 print(tokenizer.decode(output[0], skip_special_tokens=True))

PEFT Inference

PEFT adapters must be loaded on top of the base model:

1 import torch
2 from transformers import AutoModelForCausalLM, AutoTokenizer
3 from peft import PeftModel
4 
5 base_model_name = "meta-llama/Llama-3.2-1B"
6 tokenizer = AutoTokenizer.from_pretrained(base_model_name)
7 model = AutoModelForCausalLM.from_pretrained(base_model_name)
8 
9 adapter_path = "checkpoints/epoch_0_step_10/model/"
10 model = PeftModel.from_pretrained(model, adapter_path)
11 
12 device = "cuda" if torch.cuda.is_available() else "cpu"
13 model.to(device)
14 
15 prompt = (
16     "Context: Architecturally, the school has a Catholic character. "
17     "Atop the Main Building's gold dome is a golden statue of the Virgin Mary. "
18     "Immediately in front of the Main Building and facing it, is a copper statue of Christ "
19     "with arms upraised with the legend 'Venite Ad Me Omnes'.\n\n"
20     "Question: What is atop the Main Building?\n\n"
21     "Answer:"
22 )
23 inputs = tokenizer(prompt, return_tensors="pt").to(device)
24 output = model.generate(**inputs, max_new_tokens=50)
25 print(tokenizer.decode(output[0], skip_special_tokens=True))

Evaluate the Fine-Tuned Model

During Training: Validation Loss

The recipe automatically computes validation loss at the interval set by val_every_steps. Look for [val] lines in the training log:

INFO:root:[val] step 20 | epoch 0 | loss 0.2469

A decreasing validation loss across checkpoints indicates the model is learning. If validation loss plateaus or increases while training loss continues to drop, the model may be overfitting — consider stopping earlier or reducing the learning rate.

After Training: lm-eval-harness

For task-specific benchmarks (e.g., MMLU, GSM8K, HellaSwag accuracy), use lm-eval-harness with the fine-tuned checkpoint. For SFT runs with save_consolidated: false, run bash checkpoints/epoch_0_step_20/model/consolidate.sh before pointing evaluation at model/consolidated/:

$ pip install lm-eval
$ 
$ # SFT checkpoint (using vLLM backend for faster evaluation)
$ lm_eval --model vllm \
>   --model_args pretrained=checkpoints/epoch_0_step_20/model/consolidated/ \
>   --tasks hellaswag \
>   --batch_size auto
$ 
$ # PEFT adapter (using Hugging Face backend with built-in PEFT support)
$ lm_eval --model hf \
>   --model_args pretrained=meta-llama/Llama-3.2-1B,peft=checkpoints/epoch_0_step_20/model/ \
>   --tasks hellaswag \
>   --batch_size auto

The SFT example uses the vllm backend for faster evaluation (requires pip install vllm; see Deploy with vLLM for setup details). The PEFT example uses the hf backend with lm-eval’s built-in PEFT support to load the adapter on top of the base model.

Run lm-eval-harness on the base model before fine-tuning to establish a baseline, then compare against the fine-tuned checkpoint.

Publish to the Hugging Face Hub

Fine-tuned checkpoints and PEFT adapters are stored in Hugging Face-native format and can be uploaded directly to the Hub. For SFT runs with save_consolidated: false, upload model/consolidated/ after running the generated consolidation helper.

Install the Hugging Face Hub library (if not already installed):

$ pip3 install huggingface_hub

$ huggingface-cli login

Upload:

SFT checkpoint:

1 from huggingface_hub import HfApi
2 
3 api = HfApi()
4 api.upload_folder(
5     folder_path="checkpoints/epoch_0_step_10/model/consolidated",
6     repo_id="your-username/llama3.2_1b-finetuned-squad",
7     repo_type="model",
8 )

PEFT adapter:

1 from huggingface_hub import HfApi
2 
3 api = HfApi()
4 api.upload_folder(
5     folder_path="checkpoints/epoch_0_step_10/model",
6     repo_id="your-username/llama3.2_1b-lora-squad",
7     repo_type="model",
8 )

Once uploaded, load the checkpoint or adapter directly from the Hub:

SFT:

1 from transformers import AutoModelForCausalLM
2 
3 model = AutoModelForCausalLM.from_pretrained("your-username/llama3.2_1b-finetuned-squad")

PEFT:

1 from transformers import AutoModelForCausalLM
2 from peft import PeftModel
3 
4 model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
5 model = PeftModel.from_pretrained(model, "your-username/llama3.2_1b-lora-squad")

Deploy with vLLM

vLLM is an efficient inference engine for production deployment of LLMs.

Make sure vLLM is installed (pip install vllm, or use an environment that includes it).

SFT Checkpoint with vLLM

If save_consolidated: false, run the generated model/consolidate.sh helper before serving from model/consolidated/.

1 from vllm import LLM, SamplingParams
2 
3 llm = LLM(model="checkpoints/epoch_0_step_10/model/consolidated/", model_impl="transformers")
4 params = SamplingParams(max_tokens=20)
5 outputs = llm.generate("Toronto is a city in Canada.", sampling_params=params)
6 print(f"Generated text: {outputs[0].outputs[0].text}")

>>> Generated text:  It is the capital of Ontario. Toronto is a global hub for cultural tourism. The City of Toronto

PEFT Adapter with vLLM

PEFT adapter serving uses the vLLMHFExporter class, which is provided by the nemo package — a separate dependency from nemo-automodel.

Install both packages before proceeding:

$ pip install nemo vllm

1 from nemo.export.vllm_hf_exporter import vLLMHFExporter
2 
3 if __name__ == '__main__':
4     import argparse
5 
6     parser = argparse.ArgumentParser()
7     parser.add_argument('--model', required=True, type=str, help="Local path of the base model")
8     parser.add_argument('--lora-model', required=True, type=str, help="Local path of the LoRA adapter")
9     args = parser.parse_args()
10 
11     lora_model_name = "lora_model"
12 
13     exporter = vLLMHFExporter()
14     exporter.export(model=args.model, enable_lora=True)
15     exporter.add_lora_models(lora_model_name=lora_model_name, lora_model=args.lora_model)
16 
17     print("vLLM Output: ", exporter.forward(input_texts=["How are you doing?"], lora_model_name=lora_model_name))

Full Configuration Reference

This section documents all available config fields for the fine-tuning recipe. For the quick-start config, see Configure Your Training Recipe.

Switch Between SFT and PEFT

The peft: section controls which mode runs:

Mode	What to do in the YAML
PEFT (LoRA)	Include the `peft:` section as shown below.
SFT (full-parameter)	Remove/comment the `peft:` section entirely.

All other config sections remain the same for both modes.

Full Configuration

Full Config

1 # Recipe
2 # Selects which recipe class runs the training loop.
3 # Use a short name (auto-discovered) or a fully-qualified Python path:
4 #   recipe: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction
5 recipe: TrainFinetuneRecipeForNextTokenPrediction
6 
7 # Training Schedule
8 # Controls epoch count, batch sizes, and how often to checkpoint / validate.
9 # No _target_ — these are plain values read directly by the recipe.
10 step_scheduler:
11   grad_acc_steps: 4       # number of micro-batches accumulated before each optimizer
12                           # step. Effective batch = grad_acc_steps × batch_size.
13   ckpt_every_steps: 10    # save a checkpoint every N gradient steps
14   val_every_steps: 10     # run the validation loop every N gradient steps
15   num_epochs: 1           # how many full passes over the training dataset
16 
17 # Process Group
18 # Initializes the PyTorch distributed process group.
19 # No _target_ — consumed directly by the recipe.
20 # You normally would not need to tune this.
21 dist_env:
22   backend: nccl           # communication backend: "nccl" (GPU, recommended) or "gloo" (CPU)
23   timeout_minutes: 1      # timeout for collective operations; increase for large models
24                           # that take longer to initialize
25 
26 # Distributed Strategy
27 # Determines how model weights, data, and compute are split across GPUs.
28 # No _target_ — consumed directly by the recipe.
29 # See "Distributed Training: TP, PP, CP, and EP" in Advanced Topics for details.
30 distributed:
31   strategy: fsdp2         # parallelism strategy: "fsdp2" (recommended), "megatron_fsdp",
32                           # or "ddp". FSDP2 shards parameters and optimizer states across
33                           # the data-parallel group.
34   dp_size: null           # data-parallel group size. null = auto-detect from
35                           # world_size ÷ (tp_size × cp_size × pp_size).
36   tp_size: 1              # tensor-parallel size: splits weight matrices across GPUs.
37                           # Set to 2, 4, or 8 if the model doesn't fit on one GPU.
38                           # Should divide evenly into the number of attention heads.
39   cp_size: 1              # context-parallel size: splits the input sequence across GPUs.
40                           # Increase for very long contexts (e.g. 32k+ tokens).
41   sequence_parallel: false # when true, extends TP to also shard activations along
42                           # the sequence dimension for additional memory savings
43 
44 # Random Number Generator
45 # _target_ → StatefulRNG: a checkpointable RNG that ensures identical sequences
46 # across training restarts. Seed and ranked are kwargs to StatefulRNG().
47 rng:
48   _target_: nemo_automodel.components.training.rng.StatefulRNG
49   seed: 1111              # global random seed for reproducibility
50   ranked: true            # when true, each GPU rank gets a unique RNG stream derived
51                           # from the seed, so data shuffling differs per GPU
52 
53 # Model
54 # _target_ → NeMoAutoModelForCausalLM.from_pretrained: downloads (or loads from
55 # cache) a pretrained HuggingFace model and wraps it with NeMo distributed-training
56 # support. Any from_pretrained kwarg is accepted (cache_dir, torch_dtype, etc.).
57 model:
58   _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
59   pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
60 
61 # PEFT (remove / comment this entire section for full-parameter SFT)
62 # _target_ → PeftConfig: a dataclass describing which layers get LoRA adapters.
63 # The recipe passes this config into build_model(), which attaches adapters
64 # to the matching layers.
65 peft:
66   _target_: nemo_automodel.components._peft.lora.PeftConfig
67   target_modules: "*.proj" # glob pattern matched against fully-qualified layer names;
68                            # "*.proj" matches every layer ending in "proj"
69   dim: 8                   # low-rank dimension r — controls adapter capacity.
70                            # Larger values are more expressive but use more memory.
71   alpha: 32                # LoRA scaling factor: adapter output is scaled by alpha/dim.
72                            # Higher values give adapters more influence during training.
73   use_triton: True         # use an optimized Triton kernel for LoRA forward/backward
74                            # (requires the triton package)
75 
76 # Checkpointing
77 # No _target_ — plain key-value group consumed by the recipe.
78 checkpoint:
79   enabled: true            # set to false to skip saving checkpoints entirely
80   checkpoint_dir: checkpoints/  # output directory. Docker users: bind-mount this path
81                                 # (e.g. -v $(pwd)/checkpoints:/workspace/checkpoints)
82                                 # to persist checkpoints across container restarts.
83   model_save_format: safetensors  # "safetensors" (recommended, faster and safer) or
84                                   # "torch_save" (legacy pickle-based format)
85   save_consolidated: final # recommended: export consolidated HF weights only for the final checkpoint.
86                            # Other modes: false (sharded only) or every/true (export every checkpoint).
87 
88 # Training Dataset
89 # _target_ → make_squad_dataset: a factory function that downloads the SQuAD
90 # dataset, tokenizes it, and returns a torch Dataset. To use a different dataset,
91 # change _target_ to another factory function (see the dataset guide).
92 dataset:
93   _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
94   dataset_name: rajpurkar/squad  # HuggingFace Hub dataset ID
95   split: train                   # which split to use (train, validation, test)
96 
97 # Validation Dataset
98 validation_dataset:
99   _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
100   dataset_name: rajpurkar/squad
101   split: validation
102   limit_dataset_samples: 64  # cap validation set to 64 samples for faster eval loops;
103                              # remove this line to use the full validation set
104 
105 # Training Dataloader
106 # _target_ → StatefulDataLoader: a checkpointable DataLoader from torchdata that
107 # saves and restores iteration state across training restarts, so resumed runs
108 # don't re-process already-seen batches.
109 dataloader:
110   _target_: torchdata.stateful_dataloader.StatefulDataLoader
111   collate_fn: nemo_automodel.components.datasets.utils.default_collater
112                                # function that pads and batches individual samples
113                                # into tensors; can be swapped for custom collation
114   batch_size: 8                # samples per micro-batch per GPU
115   shuffle: true                # whether to shuffle the dataset each epoch
116 
117 # Validation Dataloader
118 validation_dataloader:
119   _target_: torchdata.stateful_dataloader.StatefulDataLoader
120   collate_fn: nemo_automodel.components.datasets.utils.default_collater
121   batch_size: 8
122 
123 # Loss Function
124 # _target_ → MaskedCrossEntropy: standard cross-entropy loss that automatically
125 # ignores padding tokens so they don't affect the gradient.
126 # Other available loss functions (swap _target_ to use):
127 #   - nemo_automodel.components.loss.chunked_ce.ChunkedCrossEntropy
128 #       Computes CE in chunks along the sequence dimension to reduce peak memory.
129 #       Useful for very long sequences. Accepts chunk_len (default 32).
130 #   - nemo_automodel.components.loss.linear_ce.FusedLinearCrossEntropy
131 #       Fuses the final linear projection (lm_head) with the CE computation,
132 #       avoiding the full logit tensor. Significant **memory savings** for large vocabs.
133 #   - nemo_automodel.components.loss.te_parallel_ce.TEParallelCrossEntropy
134 #       TE-based parallel CE with a Triton kernel. Designed for tensor-parallel
135 #       setups where logits are sharded across TP ranks.
136 loss_fn:
137   _target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy
138 
139 # Optimizer
140 # _target_ → torch.optim.Adam: any torch.optim class can be used here (e.g.
141 # AdamW, SGD). All remaining keys become kwargs to the constructor.
142 optimizer:
143   _target_: torch.optim.Adam
144   lr: 1.0e-5               # learning rate — the most important hyperparameter to tune
145   betas: [0.9, 0.999]      # Adam momentum coefficients (β₁ for mean, β₂ for variance)
146   eps: 1e-8                 # small constant added to the denominator for numerical stability
147   weight_decay: 0           # L2 regularization strength (0 = no regularization)
148 
149 # Logging (optional)
150 # Uncomment to enable Weights & Biases experiment tracking.
151 # wandb:
152 #   project: <your_wandb_project>    # W&B project name
153 #   entity: <your_wandb_entity>      # W&B team or username
154 #   name: <your_wandb_exp_name>      # display name for this run
155 #   save_dir: <your_wandb_save_dir>  # local directory for W&B artifacts

Config Field Reference

Section	Required?	What to change
`model`	Yes	Set `pretrained_model_name_or_path` to your Hugging Face model ID. Source: `auto_model.py`.
`peft`	PEFT only	Remove entirely for SFT. Adjust `dim` and `alpha` to tune adapter capacity. `use_triton: True` enables an optimized LoRA kernel (requires the `triton` package). For reduced memory usage, see QLoRA. Source: `lora.py`.
`dataset`	Yes	Change `_target_`, `dataset_name`, and `split` for your data. Source: `squad.py`.
`dataloader`	Optional	Adjust `batch_size` and `shuffle`. Uses `StatefulDataLoader` for checkpointable iteration. Collation: `utils.py`.
`loss_fn`	Optional	Default is `MaskedCrossEntropy`. Alternatives: `ChunkedCrossEntropy` (long sequences), `FusedLinearCrossEntropy` (large vocabs), `TEParallelCrossEntropy` (tensor-parallel).
`rng`	Optional	Controls reproducibility. Source: `rng.py`.
`step_scheduler`	Yes	`grad_acc_steps` sets how many micro-batches accumulate per gradient step. `ckpt_every_steps` and `val_every_steps` are counted in gradient steps.
`distributed`	Yes	`dp_size: null` means auto-detect from world size. Adjust `tp_size` for tensor parallelism across GPUs.
`checkpoint`	Recommended	Set `checkpoint_dir` to a persistent path, especially in Docker.
`optimizer`	Optional	Defaults are reasonable. Any `torch.optim` class can be substituted via `_target_`. For long fine-tuning, especially full-parameter SFT, see the mixed-precision training guide before combining torch AdamW with bf16 resident parameters.
`wandb`	Optional	Uncomment and configure to enable Weights & Biases logging.

For the fine-tuning recipe itself, see train_ft.py. For more example configs, browse examples/llm_finetune/.

Distributed Training: TP, PP, CP, and EP

The distributed: section controls how the model and data are split across GPUs. NeMo AutoModel supports four parallelism dimensions, each of which slices the workload differently:

Dimension	Key	What it shards	When to use
Data Parallel (DP)	`dp_size`	Replicates the model on each group of GPUs; each replica trains on a different data batch.	Default. Scales batch size linearly with GPU count.
Tensor Parallel (TP)	`tp_size`	Splits individual weight matrices (attention, MLP) across GPUs within a node.	Model is too large to fit on a single GPU, or you want to reduce per-GPU memory at the cost of extra communication.
Pipeline Parallel (PP)	`pp_size`	Assigns different layers (stages) to different GPUs and pipelines micro-batches through them.	Very deep models that don’t fit even with TP, or multi-node training where TP’s all-reduce is too expensive across nodes.
Context Parallel (CP)	`cp_size`	Splits the input sequence across GPUs so each GPU processes a portion of the context.	Very long sequences that exceed single-GPU memory.
Expert Parallel (EP)	`ep_size`	Distributes MoE experts across GPUs so each GPU holds a subset of experts.	Mixture-of-Experts models only.

These dimensions compose with each other. The relationship between them and total GPU count is:

world_size = pp_size × dp_size × cp_size × tp_size

When dp_size is set to null (the default), it is inferred automatically:

dp_size = world_size ÷ (tp_size × cp_size × pp_size)

EP does not appear in this formula — experts are distributed across the DP×CP rank groups, with the constraint that (dp_size × cp_size) must be divisible by ep_size.

Data Parallel (default)

Data parallelism is the default. With strategy: fsdp2, FSDP2 shards both model parameters and optimizer states across the DP group, so memory usage shrinks as you add GPUs:

1 distributed:
2   strategy: fsdp2
3   dp_size: null   # auto-detected from world_size ÷ (tp × cp × pp)
4   tp_size: 1
5   cp_size: 1

Tensor Parallelism

TP splits weight matrices across GPUs within a single node. Set tp_size to the number of GPUs you want to shard over (typically 2, 4, or 8 — should divide evenly into the number of attention heads):

1 distributed:
2   strategy: fsdp2
3   dp_size: null
4   tp_size: 4
5   cp_size: 1
6   sequence_parallel: false   # set to true for additional memory savings

sequence_parallel: true extends TP to also shard activation memory along the sequence dimension, further reducing per-GPU memory at the cost of additional communication.

Pipeline Parallelism

PP assigns groups of layers to different GPUs and streams micro-batches through the stages. It requires an additional nested pipeline: section:

1 distributed:
2   strategy: fsdp2
3   dp_size: null
4   tp_size: 4
5   pp_size: 4
6   cp_size: 1
7   activation_checkpointing: true
8 
9   pipeline:
10     pp_schedule: interleaved1f1b  # pipeline schedule (1f1b or interleaved1f1b)
11     pp_microbatch_size: 1         # micro-batch size per pipeline step
12     layers_per_stage: 4           # how many layers each stage handles
13     scale_grads_in_schedule: false

Key	Role
`pp_schedule`	The micro-batch schedule. `1f1b` is simpler; `interleaved1f1b` overlaps compute and communication for better throughput.
`pp_microbatch_size`	Number of samples per micro-batch fed into the pipeline. Must satisfy: `local_batch_size ÷ pp_microbatch_size ≥ pp_size`.
`layers_per_stage`	How many transformer layers each pipeline stage contains. If omitted, the framework splits layers evenly across `pp_size` stages.

PP requires the model to define a _pp_plan that tells the framework how to split layers into stages. All built-in models include this plan; custom models must add one.

Context Parallelism

CP splits the sequence across GPUs — useful for very long contexts that exceed single-GPU memory. Set cp_size to the desired split factor:

1 distributed:
2   strategy: fsdp2
3   dp_size: null
4   tp_size: 1
5   cp_size: 2

When cp_size > 1, fused RoPE is automatically disabled. Some models also require the Transformer Engine (TE) attention backend for CP with packed sequences — the framework will raise an error with instructions if this applies.

Expert Parallelism (MoE models)

EP distributes MoE experts across GPUs. Set ep_size to the number of GPUs that share the full set of experts:

1 distributed:
2   strategy: fsdp2
3   tp_size: 1
4   cp_size: 1
5   pp_size: 1
6   ep_size: 8
7   activation_checkpointing: true

EP only applies to Mixture-of-Experts models (e.g. Qwen3-MoE, Mixtral, DeepSeek-V3). For dense models, leave ep_size at 1 or omit it.

Combine Multiple Dimensions

You can combine TP, PP, CP, and EP in a single config. For example, a large MoE model on a multi-node cluster might use:

1 distributed:
2   strategy: fsdp2
3   dp_size: null
4   tp_size: 1
5   cp_size: 2
6   pp_size: 1
7   ep_size: 4
8   activation_checkpointing: true

When choosing a combination, keep these rules in mind:

world_size must divide evenly into pp_size × tp_size × cp_size (the remainder becomes dp_size).
(dp_size × cp_size) % ep_size == 0 — EP shares the DP×CP groups.
TP within a node, PP across nodes is the typical layout — TP requires fast NVLink bandwidth, while PP tolerates higher latency.
Start simple. Use DP-only first. Add TP if the model doesn’t fit on one GPU. Add PP for very large models. Add CP for long sequences. Add EP only for MoE architectures.

Next Steps

Integrate Your Own Text Dataset — swap the SQuAD example for your own data.
Recipes and End-to-End Examples — browse the full set of recipes available in NeMo AutoModel. See also the examples/llm_finetune/ directory for ready-to-run configs.
Dataset Overview: LLM, VLM, and Retrieval Datasets — see all supported dataset types across LLM, VLM, and retrieval tasks.
Knowledge Distillation — distill a fine-tuned model into a smaller one.