Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PEFT) with NeMo AutoModel

View as Markdown

Introduction

Pretrained language models are general-purpose: they know a lot about language but nothing about your particular domain, terminology, or task. Fine-tuning bridges that gap β€” you fine-tune the model on your own examples so it produces answers that are accurate and relevant for your use case, without the cost of training a model from scratch. The result is a model optimized for your data that you can evaluate, publish, and deploy. This guide walks you through that process end-to-end with NeMo AutoModel β€” from installation through training, evaluation, and deployment β€” using Meta LLaMA 3.2 1B and the SQuAD v1.1 dataset as a running example.

NeMo AutoModel supports two fine-tuning modes:

  • Supervised Fine-Tuning (SFT) updates all model parameters. Use SFT when you need maximum accuracy and have sufficient compute.
  • Parameter-Efficient Fine-Tuning (PEFT) using LoRA freezes the base model and trains small low-rank adapters. PEFT reduces trainable parameters to less than 1% of the original model, lowering memory and storage costs.

Workflow Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Install β”‚--->β”‚ 2. Configure β”‚--->β”‚ 3. Train β”‚--->β”‚ 4. Inference β”‚--->β”‚ 5. Evaluate β”‚--->β”‚ 6. Publish β”‚--->β”‚ 7. Deploy β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ (optional) β”‚ β”‚ (optional) β”‚
β”‚ pip install β”‚ β”‚ YAML config β”‚ β”‚ automodel CLIβ”‚ β”‚ HF generate β”‚ β”‚ Val loss + β”‚ β”‚ HF Hub β”‚ β”‚ vLLM serving β”‚
β”‚ or Docker β”‚ β”‚ Choose SFT β”‚ β”‚ or torchrun β”‚ β”‚ API β”‚ β”‚ lm-eval- β”‚ β”‚ upload β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ or PEFT β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ harness β”‚ β”‚ β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
StepSectionSFTPEFT
1. InstallInstall NeMo AutoModelSameSame
2. ConfigureConfigure Your Training RecipeYAML without peft: sectionYAML with peft: section
3. TrainFine-Tune the ModelSame command for both modesSame command for both modes
4. InferenceRun InferenceLoad consolidated checkpoint directlyLoad base model + adapter
5. EvaluateEvaluate the Fine-Tuned ModelValidation loss during training; lm-eval-harness post-trainingSame
6. PublishPublish to HF HubUpload model/consolidated/Upload model/ (adapter only)
7. DeployDeploy with vLLMvllm.LLM(model=...)vLLMHFExporter with --lora-model

Install NeMo AutoModel

$pip3 install nemo-automodel

Alternatively, if you run into dependency or driver issues, use the pre-built Docker container:

$docker pull nvcr.io/nvidia/nemo-automodel:26.06.00
$docker run --gpus all -it --rm --shm-size=8g -v $(pwd)/checkpoints:/tmp/checkpoints/ nvcr.io/nvidia/nemo-automodel:26.06.00

Docker containers are ephemeral β€” files written inside the container are lost when it stops. The -v flag in the docker run command above bind-mounts a local checkpoints/ directory into the container so that saved checkpoints persist across runs. For more details, see Save Checkpoints When Using Docker.

For the full set of installation methods, see the installation guide.

Configure Your Training Recipe

Training is configured through a YAML config file with three required sections β€” model, dataset, and step_scheduler β€” plus an optional peft section. The sections below walk through each one. For the complete copy-pastable file, see Full Config YAML.

Under the hood, both SFT and PEFT are executed by a recipe: a self-contained Python class that wires together model loading, dataset preparation, training, checkpointing, and logging. The fine-tuning recipe is TrainFinetuneRecipeForNextTokenPrediction. The config file tells the recipe what to build; the recipe decides how to build it.

NeMo AutoModel configs use a convention borrowed from Hydra: the special _target_ key tells the framework which Python class or function to call, and every other key in the same YAML block is passed as a keyword argument to that call. For example:

1optimizer:
2 _target_: torch.optim.Adam
3 lr: 1.0e-5
4 weight_decay: 0

is equivalent to writing this Python code:

1from torch.optim import Adam
2
3optimizer = Adam(lr=1.0e-5, weight_decay=0)

The _target_ value is a dotted Python import path: the same string you would use in an import statement. The framework resolves it at runtime by importing the module and looking up the attribute. This means you can point _target_ at any class constructor or factory function, and the remaining keys become its arguments.

To discover which parameters a section accepts, look up the Python signature of its _target_. For instance, torch.optim.Adam accepts lr, betas, eps, and weight_decay β€” those are the keys you can set in the YAML.

From YAML to running code. Here is the path a config takes through the framework:

finetune_config.yaml
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” load_yaml_config() parses the file into
β”‚ ConfigNode │◄─── a tree of ConfigNode objects, one per
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ YAML section.
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” The recipe's setup() method reads
β”‚ Recipe │◄─── each section from the ConfigNode tree
β”‚ setup() β”‚ and passes it to the matching builder.
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό β–Ό β–Ό
build_model build_optimizer build_dataloader build_loss_module ...
β”‚ β”‚ β”‚ β”‚
β–Ό β–Ό β–Ό β–Ό
cfg.model cfg.optimizer cfg.dataset cfg.loss_fn
.instantiate() .instantiate() .instantiate() .instantiate()
β”‚ β”‚ β”‚ β”‚
β–Ό β–Ό β–Ό β–Ό
Resolves Resolves Resolves Resolves
_target_, _target_, _target_, _target_,
calls it calls it calls it calls it
with kwargs with kwargs with kwargs with kwargs

Each builder function calls .instantiate() on its config section. .instantiate() does two things:

  1. Resolves _target_ β€” imports the Python path and obtains the callable (class or function).
  2. Calls it β€” passes every other key in the section as a keyword argument.

Nested _target_ blocks (like collate_fn inside dataloader) are recursively instantiated the same way.

The recipe key. Every config file includes a top-level recipe key that tells the CLI which recipe class to run. You can write it as a short name or as a fully-qualified Python path β€” both resolve to the same class:

1# Short name (the CLI looks up the class automatically)
2recipe: TrainFinetuneRecipeForNextTokenPrediction
3
4# Fully-qualified path (used as-is)
5recipe: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction

The short name form is a convenience β€” the CLI scans all recipe modules under nemo_automodel.recipes and matches the bare class name. If you invoke the recipe script directly with torchrun instead of the automodel CLI, the recipe key is not required because the script itself is the recipe.

Not every section uses _target_. Some sections like step_scheduler, distributed, and checkpoint are plain key-value groups consumed directly by the recipe β€” they control training schedule, parallelism strategy, and checkpoint behavior without instantiating a Python object.

Model

1model:
2 _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
3 pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
KeyRole
_target_Points to NeMoAutoModelForCausalLM.from_pretrained β€” a factory method that downloads (or loads from cache) a pretrained Hugging Face model and wraps it with NeMo distributed-training support.
pretrained_model_name_or_pathA keyword argument to from_pretrained. Any argument that from_pretrained accepts can be added here (e.g., cache_dir, torch_dtype).

This guide uses Meta Llama 3.2 1B as a running example. Replace pretrained_model_name_or_path with any supported Hugging Face model ID.

Llama is a family of decoder-only transformer models developed by Meta. The 1B variant is a compact model suitable for research and edge deployment, featuring RoPE positional embeddings, grouped-query attention (GQA), and SwiGLU activations.

Some Hugging Face models are gated. If the model page shows a β€œRequest access” button:

  1. Log in with your Hugging Face account and accept the license.
  2. Ensure the token you use (from huggingface-cli login or HF_TOKEN) belongs to the approved account.

Pulling a gated model without an authorized token triggers a 403 error.

Dataset

1dataset:
2 _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
3 dataset_name: rajpurkar/squad # HF-Hub ID used to pull the dataset
4 split: train
5
6validation_dataset:
7 _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
8 dataset_name: rajpurkar/squad
9 split: validation
KeyRole
_target_Points to make_squad_dataset β€” a factory function that downloads the SQuAD dataset, tokenizes it, and returns a torch.utils.data.Dataset. To use a different dataset, change _target_ to a different factory function (see Integrate Your Own Text Dataset).
dataset_name, splitKeyword arguments passed to make_squad_dataset. Each dataset factory defines its own parameters β€” check the function signature to see what’s available.

This guide uses SQuAD v1.1 as a running example. Swap the dataset by changing _target_ and the dataset arguments β€” see Integrate Your Own Text Dataset and Dataset Overview: LLM, VLM, and Retrieval Datasets.

The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset where each example consists of a Wikipedia passage, a question, and an answer span. SQuAD v1.1 guarantees all questions are answerable from the context, making it suitable for straightforward fine-tuning.

Example:

1{
2 "context": "Architecturally, the school has a Catholic character. ...",
3 "question": "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?",
4 "answers": { "text": ["Saint Bernadette Soubirous"], "answer_start": [515] }
5}

PEFT (Optional)

1peft:
2 _target_: nemo_automodel.components._peft.lora.PeftConfig
3 target_modules: "*.proj" # glob pattern matching linear layer FQNs
4 dim: 8 # low-rank dimension of the adapters
5 alpha: 32 # scaling factor for learned weights
KeyRole
_target_Points to PeftConfig β€” a dataclass that describes which layers to adapt and how. Unlike the model and dataset sections, this instantiation produces a config object, not the adapter itself. The recipe passes the resulting PeftConfig into build_model, which applies LoRA adapters to the model.
target_modulesA glob pattern matched against fully-qualified layer names (e.g. "*.proj" matches every layer whose name ends in proj).
dimThe low-rank dimension r β€” controls adapter capacity. Larger values learn more but use more memory.
alphaScaling factor applied to the adapter output (alpha / dim). Higher values give adapters more influence during training.

Including a peft: section enables LoRA fine-tuning. Remove it entirely to run SFT instead β€” see Switch Between SFT and PEFT.

QLoRA (Quantized Low-Rank Adaptation)

If GPU memory is a constraint, QLoRA combines LoRA with 4-bit NormalFloat (NF4) quantization to reduce memory usage by up to 75% compared to full-parameter SFT in 16-bit precision, while maintaining comparable quality to standard LoRA.

To enable QLoRA, add a quantization: section alongside the peft: section in your config. Note two differences from the standard PEFT config above: target_modules uses the broader "*_proj" pattern to apply LoRA to all projection layers (wider coverage compensates for precision loss from 4-bit weights), and dim is increased from 8 to 16 for additional adapter capacity.

1model:
2 _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
3 pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
4
5peft:
6 _target_: nemo_automodel.components._peft.lora.PeftConfig
7 target_modules: "*_proj" # broader glob than "*.proj" to cover all projection layers
8 dim: 16 # LoRA rank (higher than default to offset quantization)
9 alpha: 32 # scaling factor
10 dropout: 0.1 # LoRA dropout rate
11
12quantization:
13 load_in_4bit: True # enable 4-bit quantization
14 load_in_8bit: False # use 4-bit, not 8-bit
15 bnb_4bit_compute_dtype: bfloat16 # compute dtype
16 bnb_4bit_use_double_quant: True # double quantization for extra savings
17 bnb_4bit_quant_type: nf4 # NormalFloat quantization type
18 bnb_4bit_quant_storage: bfloat16 # storage dtype for quantized weights

Training Schedule

1step_scheduler:
2 num_epochs: 1 # Will train over the dataset once.

Unlike the sections above, step_scheduler has no _target_ β€” it is not instantiated into a Python object. Instead, the recipe reads its keys directly to control the training loop (how many epochs to run, when to checkpoint, when to validate). This is typical of sections that configure behavior rather than components.

All other settings (distributed strategy, optimizer, checkpointing, logging) use sensible defaults. See the Full Configuration Reference to customize them.

Most example recipes use bf16 training by default for memory and throughput. If you are running long fine-tuning, especially full-parameter SFT, and need higher-precision optimizer state, configure it explicitly instead of assuming it from the mixed-precision compute policy. See the mixed-precision training guide for the recommended TE and torch AdamW patterns.

Full Config YAML

Save as finetune_config.yaml. This config runs PEFT (LoRA). To run SFT instead, remove the peft: section. For production-ready examples, see the hosted configs: Llama 3.2 1B SFT and Llama 3.2 1B PEFT.

1model:
2 _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
3 pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
4
5peft:
6 _target_: nemo_automodel.components._peft.lora.PeftConfig
7 target_modules: "*.proj"
8 dim: 8
9 alpha: 32
10
11dataset:
12 _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
13 dataset_name: rajpurkar/squad
14 split: train
15
16validation_dataset:
17 _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
18 dataset_name: rajpurkar/squad
19 split: validation
20
21step_scheduler:
22 num_epochs: 1

Fine-Tune the Model

You can run the recipe using the AutoModel CLI or directly with torchrun (advanced).

$automodel --nproc-per-node=8 finetune_config.yaml

The --nproc-per-node=8 flag specifies the number of GPUs per node. Adjust as needed (for a single GPU, omit the --nproc-per-node option).

Invoke the Recipe Script Directly (Advanced)

Alternatively, you can invoke the recipe script directly using torchrun, as shown below.

$torchrun --nproc-per-node=8 nemo_automodel/recipes/llm/train_ft.py -c finetune_config.yaml

Sample Output

Running the recipe with the automodel app or by invoking the recipe script directly produces the following log:

$ automodel finetune_config.yaml
INFO:nemo_automodel.cli.app:Config: finetune_config.yaml
INFO:nemo_automodel.cli.app:Recipe: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction
INFO:nemo_automodel.cli.app:Launching job interactively (local)
cfg-path: finetune_config.yaml
INFO:root:step 4 | epoch 0 | loss 1.5514 | grad_norm 102.0000 | mem: 11.66 GiB | tps 6924.50
INFO:root:step 8 | epoch 0 | loss 0.7913 | grad_norm 46.2500 | mem: 14.58 GiB | tps 9328.79
Saving checkpoint to checkpoints/epoch_0_step_10
INFO:root:step 12 | epoch 0 | loss 0.4358 | grad_norm 23.8750 | mem: 15.48 GiB | tps 9068.99
INFO:root:step 16 | epoch 0 | loss 0.2057 | grad_norm 12.9375 | mem: 16.47 GiB | tps 9148.28
INFO:root:step 20 | epoch 0 | loss 0.2557 | grad_norm 13.4375 | mem: 12.35 GiB | tps 9196.97
Saving checkpoint to checkpoints/epoch_0_step_20
INFO:root:[val] step 20 | epoch 0 | loss 0.2469

Each log line reports the current loss, gradient norm, peak GPU memory, and tokens per second (TPS). Small fluctuations between steps (e.g., 0.2057 to 0.2557 above) are normal β€” look at the overall downward trend rather than individual values.

Checkpoint Contents

Checkpoints are saved as Hugging Face-compatible safetensors. By default, SFT writes sharded model weights plus a generated model/consolidate.sh helper; run the helper after training to create model/consolidated/ for Transformers, vLLM, lm-eval-harness, and other Hugging Face ecosystem tools. You can also set save_consolidated: final to export consolidated HF weights only for the final checkpoint. Use save_consolidated: every (or legacy true) only when you intentionally want inline HF export at every checkpoint save. PEFT checkpoints contain only the adapter weights (~MBs instead of GBs) and are saved directly under model/. They do not use model/consolidate.sh β€” at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment).

SFT checkpoint:

$$ tree checkpoints/epoch_0_step_10/
$checkpoints/epoch_0_step_10/
$β”œβ”€β”€ config.yaml
$β”œβ”€β”€ dataloader.pt
$β”œβ”€β”€ model
$β”‚ β”œβ”€β”€ consolidate.sh
$β”‚ β”œβ”€β”€ shard-00001-model-00001-of-00001.safetensors
$β”‚ └── shard-00002-model-00001-of-00001.safetensors
$β”œβ”€β”€ optim
$β”‚ β”œβ”€β”€ __0_0.distcp
$β”‚ └── __1_0.distcp
$β”œβ”€β”€ rng.pt
$└── step_scheduler.pt
$
$3 directories, 9 files

PEFT checkpoint:

$$ tree checkpoints/epoch_0_step_10/
$checkpoints/epoch_0_step_10/
$β”œβ”€β”€ dataloader.pt
$β”œβ”€β”€ config.yaml
$β”œβ”€β”€ model
$β”‚ β”œβ”€β”€ adapter_config.json
$β”‚ β”œβ”€β”€ adapter_model.safetensors
$β”‚ └── automodel_peft_config.json
$β”œβ”€β”€ optim
$β”‚ β”œβ”€β”€ __0_0.distcp
$β”‚ └── __1_0.distcp
$β”œβ”€β”€ rng.pt
$└── step_scheduler.pt
$
$2 directories, 8 files

Run Inference

Inference uses the Hugging Face generate API. Because exported SFT checkpoints are self-contained while PEFT checkpoints store only adapter weights (see Checkpoint Contents), the loading procedure differs between the two modes.

SFT Inference

If save_consolidated: false, first run the generated helper for the checkpoint you want to load:

$bash checkpoints/epoch_0_step_10/model/consolidate.sh

The exported SFT checkpoint at model/consolidated/ is a complete Hugging Face model and can be loaded directly:

1import torch
2from transformers import AutoModelForCausalLM, AutoTokenizer
3
4ckpt_path = "checkpoints/epoch_0_step_10/model/consolidated"
5tokenizer = AutoTokenizer.from_pretrained(ckpt_path)
6model = AutoModelForCausalLM.from_pretrained(ckpt_path)
7
8device = "cuda" if torch.cuda.is_available() else "cpu"
9model.to(device)
10
11prompt = (
12 "Context: Architecturally, the school has a Catholic character. "
13 "Atop the Main Building's gold dome is a golden statue of the Virgin Mary. "
14 "Immediately in front of the Main Building and facing it, is a copper statue of Christ "
15 "with arms upraised with the legend 'Venite Ad Me Omnes'.\n\n"
16 "Question: What is atop the Main Building?\n\n"
17 "Answer:"
18)
19inputs = tokenizer(prompt, return_tensors="pt").to(device)
20output = model.generate(**inputs, max_new_tokens=50)
21print(tokenizer.decode(output[0], skip_special_tokens=True))

PEFT Inference

PEFT adapters must be loaded on top of the base model:

1import torch
2from transformers import AutoModelForCausalLM, AutoTokenizer
3from peft import PeftModel
4
5base_model_name = "meta-llama/Llama-3.2-1B"
6tokenizer = AutoTokenizer.from_pretrained(base_model_name)
7model = AutoModelForCausalLM.from_pretrained(base_model_name)
8
9adapter_path = "checkpoints/epoch_0_step_10/model/"
10model = PeftModel.from_pretrained(model, adapter_path)
11
12device = "cuda" if torch.cuda.is_available() else "cpu"
13model.to(device)
14
15prompt = (
16 "Context: Architecturally, the school has a Catholic character. "
17 "Atop the Main Building's gold dome is a golden statue of the Virgin Mary. "
18 "Immediately in front of the Main Building and facing it, is a copper statue of Christ "
19 "with arms upraised with the legend 'Venite Ad Me Omnes'.\n\n"
20 "Question: What is atop the Main Building?\n\n"
21 "Answer:"
22)
23inputs = tokenizer(prompt, return_tensors="pt").to(device)
24output = model.generate(**inputs, max_new_tokens=50)
25print(tokenizer.decode(output[0], skip_special_tokens=True))

Evaluate the Fine-Tuned Model

During Training: Validation Loss

The recipe automatically computes validation loss at the interval set by val_every_steps. Look for [val] lines in the training log:

INFO:root:[val] step 20 | epoch 0 | loss 0.2469

A decreasing validation loss across checkpoints indicates the model is learning. If validation loss plateaus or increases while training loss continues to drop, the model may be overfitting β€” consider stopping earlier or reducing the learning rate.

After Training: lm-eval-harness

For task-specific benchmarks (e.g., MMLU, GSM8K, HellaSwag accuracy), use lm-eval-harness with the fine-tuned checkpoint. For SFT runs with save_consolidated: false, run bash checkpoints/epoch_0_step_20/model/consolidate.sh before pointing evaluation at model/consolidated/:

$pip install lm-eval
$
$# SFT checkpoint (using vLLM backend for faster evaluation)
$lm_eval --model vllm \
> --model_args pretrained=checkpoints/epoch_0_step_20/model/consolidated/ \
> --tasks hellaswag \
> --batch_size auto
$
$# PEFT adapter (using Hugging Face backend with built-in PEFT support)
$lm_eval --model hf \
> --model_args pretrained=meta-llama/Llama-3.2-1B,peft=checkpoints/epoch_0_step_20/model/ \
> --tasks hellaswag \
> --batch_size auto

The SFT example uses the vllm backend for faster evaluation (requires pip install vllm; see Deploy with vLLM for setup details). The PEFT example uses the hf backend with lm-eval’s built-in PEFT support to load the adapter on top of the base model.

Run lm-eval-harness on the base model before fine-tuning to establish a baseline, then compare against the fine-tuned checkpoint.

Publish to the Hugging Face Hub

Fine-tuned checkpoints and PEFT adapters are stored in Hugging Face-native format and can be uploaded directly to the Hub. For SFT runs with save_consolidated: false, upload model/consolidated/ after running the generated consolidation helper.

  1. Install the Hugging Face Hub library (if not already installed):
$pip3 install huggingface_hub
  1. Log in to Hugging Face:
$huggingface-cli login
  1. Upload:

SFT checkpoint:

1from huggingface_hub import HfApi
2
3api = HfApi()
4api.upload_folder(
5 folder_path="checkpoints/epoch_0_step_10/model/consolidated",
6 repo_id="your-username/llama3.2_1b-finetuned-squad",
7 repo_type="model",
8)

PEFT adapter:

1from huggingface_hub import HfApi
2
3api = HfApi()
4api.upload_folder(
5 folder_path="checkpoints/epoch_0_step_10/model",
6 repo_id="your-username/llama3.2_1b-lora-squad",
7 repo_type="model",
8)

Once uploaded, load the checkpoint or adapter directly from the Hub:

SFT:

1from transformers import AutoModelForCausalLM
2
3model = AutoModelForCausalLM.from_pretrained("your-username/llama3.2_1b-finetuned-squad")

PEFT:

1from transformers import AutoModelForCausalLM
2from peft import PeftModel
3
4model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
5model = PeftModel.from_pretrained(model, "your-username/llama3.2_1b-lora-squad")

Deploy with vLLM

vLLM is an efficient inference engine for production deployment of LLMs.

Make sure vLLM is installed (pip install vllm, or use an environment that includes it).

SFT Checkpoint with vLLM

If save_consolidated: false, run the generated model/consolidate.sh helper before serving from model/consolidated/.

1from vllm import LLM, SamplingParams
2
3llm = LLM(model="checkpoints/epoch_0_step_10/model/consolidated/", model_impl="transformers")
4params = SamplingParams(max_tokens=20)
5outputs = llm.generate("Toronto is a city in Canada.", sampling_params=params)
6print(f"Generated text: {outputs[0].outputs[0].text}")
>>> Generated text: It is the capital of Ontario. Toronto is a global hub for cultural tourism. The City of Toronto

PEFT Adapter with vLLM

PEFT adapter serving uses the vLLMHFExporter class, which is provided by the nemo package β€” a separate dependency from nemo-automodel.

Install both packages before proceeding:

$pip install nemo vllm
1from nemo.export.vllm_hf_exporter import vLLMHFExporter
2
3if __name__ == '__main__':
4 import argparse
5
6 parser = argparse.ArgumentParser()
7 parser.add_argument('--model', required=True, type=str, help="Local path of the base model")
8 parser.add_argument('--lora-model', required=True, type=str, help="Local path of the LoRA adapter")
9 args = parser.parse_args()
10
11 lora_model_name = "lora_model"
12
13 exporter = vLLMHFExporter()
14 exporter.export(model=args.model, enable_lora=True)
15 exporter.add_lora_models(lora_model_name=lora_model_name, lora_model=args.lora_model)
16
17 print("vLLM Output: ", exporter.forward(input_texts=["How are you doing?"], lora_model_name=lora_model_name))

Full Configuration Reference

This section documents all available config fields for the fine-tuning recipe. For the quick-start config, see Configure Your Training Recipe.

Switch Between SFT and PEFT

The peft: section controls which mode runs:

ModeWhat to do in the YAML
PEFT (LoRA)Include the peft: section as shown below.
SFT (full-parameter)Remove/comment the peft: section entirely.

All other config sections remain the same for both modes.

Full Configuration

1# Recipe
2# Selects which recipe class runs the training loop.
3# Use a short name (auto-discovered) or a fully-qualified Python path:
4# recipe: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction
5recipe: TrainFinetuneRecipeForNextTokenPrediction
6
7# Training Schedule
8# Controls epoch count, batch sizes, and how often to checkpoint / validate.
9# No _target_ β€” these are plain values read directly by the recipe.
10step_scheduler:
11 grad_acc_steps: 4 # number of micro-batches accumulated before each optimizer
12 # step. Effective batch = grad_acc_steps Γ— batch_size.
13 ckpt_every_steps: 10 # save a checkpoint every N gradient steps
14 val_every_steps: 10 # run the validation loop every N gradient steps
15 num_epochs: 1 # how many full passes over the training dataset
16
17# Process Group
18# Initializes the PyTorch distributed process group.
19# No _target_ β€” consumed directly by the recipe.
20# You normally would not need to tune this.
21dist_env:
22 backend: nccl # communication backend: "nccl" (GPU, recommended) or "gloo" (CPU)
23 timeout_minutes: 1 # timeout for collective operations; increase for large models
24 # that take longer to initialize
25
26# Distributed Strategy
27# Determines how model weights, data, and compute are split across GPUs.
28# No _target_ β€” consumed directly by the recipe.
29# See "Distributed Training: TP, PP, CP, and EP" in Advanced Topics for details.
30distributed:
31 strategy: fsdp2 # parallelism strategy: "fsdp2" (recommended), "megatron_fsdp",
32 # or "ddp". FSDP2 shards parameters and optimizer states across
33 # the data-parallel group.
34 dp_size: null # data-parallel group size. null = auto-detect from
35 # world_size Γ· (tp_size Γ— cp_size Γ— pp_size).
36 tp_size: 1 # tensor-parallel size: splits weight matrices across GPUs.
37 # Set to 2, 4, or 8 if the model doesn't fit on one GPU.
38 # Should divide evenly into the number of attention heads.
39 cp_size: 1 # context-parallel size: splits the input sequence across GPUs.
40 # Increase for very long contexts (e.g. 32k+ tokens).
41 sequence_parallel: false # when true, extends TP to also shard activations along
42 # the sequence dimension for additional memory savings
43
44# Random Number Generator
45# _target_ β†’ StatefulRNG: a checkpointable RNG that ensures identical sequences
46# across training restarts. Seed and ranked are kwargs to StatefulRNG().
47rng:
48 _target_: nemo_automodel.components.training.rng.StatefulRNG
49 seed: 1111 # global random seed for reproducibility
50 ranked: true # when true, each GPU rank gets a unique RNG stream derived
51 # from the seed, so data shuffling differs per GPU
52
53# Model
54# _target_ β†’ NeMoAutoModelForCausalLM.from_pretrained: downloads (or loads from
55# cache) a pretrained HuggingFace model and wraps it with NeMo distributed-training
56# support. Any from_pretrained kwarg is accepted (cache_dir, torch_dtype, etc.).
57model:
58 _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
59 pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
60
61# PEFT (remove / comment this entire section for full-parameter SFT)
62# _target_ β†’ PeftConfig: a dataclass describing which layers get LoRA adapters.
63# The recipe passes this config into build_model(), which attaches adapters
64# to the matching layers.
65peft:
66 _target_: nemo_automodel.components._peft.lora.PeftConfig
67 target_modules: "*.proj" # glob pattern matched against fully-qualified layer names;
68 # "*.proj" matches every layer ending in "proj"
69 dim: 8 # low-rank dimension r β€” controls adapter capacity.
70 # Larger values are more expressive but use more memory.
71 alpha: 32 # LoRA scaling factor: adapter output is scaled by alpha/dim.
72 # Higher values give adapters more influence during training.
73 use_triton: True # use an optimized Triton kernel for LoRA forward/backward
74 # (requires the triton package)
75
76# Checkpointing
77# No _target_ β€” plain key-value group consumed by the recipe.
78checkpoint:
79 enabled: true # set to false to skip saving checkpoints entirely
80 checkpoint_dir: checkpoints/ # output directory. Docker users: bind-mount this path
81 # (e.g. -v $(pwd)/checkpoints:/workspace/checkpoints)
82 # to persist checkpoints across container restarts.
83 model_save_format: safetensors # "safetensors" (recommended, faster and safer) or
84 # "torch_save" (legacy pickle-based format)
85 save_consolidated: final # recommended: export consolidated HF weights only for the final checkpoint.
86 # Other modes: false (sharded only) or every/true (export every checkpoint).
87
88# Training Dataset
89# _target_ β†’ make_squad_dataset: a factory function that downloads the SQuAD
90# dataset, tokenizes it, and returns a torch Dataset. To use a different dataset,
91# change _target_ to another factory function (see the dataset guide).
92dataset:
93 _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
94 dataset_name: rajpurkar/squad # HuggingFace Hub dataset ID
95 split: train # which split to use (train, validation, test)
96
97# Validation Dataset
98validation_dataset:
99 _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
100 dataset_name: rajpurkar/squad
101 split: validation
102 limit_dataset_samples: 64 # cap validation set to 64 samples for faster eval loops;
103 # remove this line to use the full validation set
104
105# Training Dataloader
106# _target_ β†’ StatefulDataLoader: a checkpointable DataLoader from torchdata that
107# saves and restores iteration state across training restarts, so resumed runs
108# don't re-process already-seen batches.
109dataloader:
110 _target_: torchdata.stateful_dataloader.StatefulDataLoader
111 collate_fn: nemo_automodel.components.datasets.utils.default_collater
112 # function that pads and batches individual samples
113 # into tensors; can be swapped for custom collation
114 batch_size: 8 # samples per micro-batch per GPU
115 shuffle: true # whether to shuffle the dataset each epoch
116
117# Validation Dataloader
118validation_dataloader:
119 _target_: torchdata.stateful_dataloader.StatefulDataLoader
120 collate_fn: nemo_automodel.components.datasets.utils.default_collater
121 batch_size: 8
122
123# Loss Function
124# _target_ β†’ MaskedCrossEntropy: standard cross-entropy loss that automatically
125# ignores padding tokens so they don't affect the gradient.
126# Other available loss functions (swap _target_ to use):
127# - nemo_automodel.components.loss.chunked_ce.ChunkedCrossEntropy
128# Computes CE in chunks along the sequence dimension to reduce peak memory.
129# Useful for very long sequences. Accepts chunk_len (default 32).
130# - nemo_automodel.components.loss.linear_ce.FusedLinearCrossEntropy
131# Fuses the final linear projection (lm_head) with the CE computation,
132# avoiding the full logit tensor. Significant **memory savings** for large vocabs.
133# - nemo_automodel.components.loss.te_parallel_ce.TEParallelCrossEntropy
134# TE-based parallel CE with a Triton kernel. Designed for tensor-parallel
135# setups where logits are sharded across TP ranks.
136loss_fn:
137 _target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy
138
139# Optimizer
140# _target_ β†’ torch.optim.Adam: any torch.optim class can be used here (e.g.
141# AdamW, SGD). All remaining keys become kwargs to the constructor.
142optimizer:
143 _target_: torch.optim.Adam
144 lr: 1.0e-5 # learning rate β€” the most important hyperparameter to tune
145 betas: [0.9, 0.999] # Adam momentum coefficients (β₁ for mean, Ξ²β‚‚ for variance)
146 eps: 1e-8 # small constant added to the denominator for numerical stability
147 weight_decay: 0 # L2 regularization strength (0 = no regularization)
148
149# Logging (optional)
150# Uncomment to enable Weights & Biases experiment tracking.
151# wandb:
152# project: <your_wandb_project> # W&B project name
153# entity: <your_wandb_entity> # W&B team or username
154# name: <your_wandb_exp_name> # display name for this run
155# save_dir: <your_wandb_save_dir> # local directory for W&B artifacts

Config Field Reference

SectionRequired?What to change
modelYesSet pretrained_model_name_or_path to your Hugging Face model ID. Source: auto_model.py.
peftPEFT onlyRemove entirely for SFT. Adjust dim and alpha to tune adapter capacity. use_triton: True enables an optimized LoRA kernel (requires the triton package). For reduced memory usage, see QLoRA. Source: lora.py.
datasetYesChange _target_, dataset_name, and split for your data. Source: squad.py.
dataloaderOptionalAdjust batch_size and shuffle. Uses StatefulDataLoader for checkpointable iteration. Collation: utils.py.
loss_fnOptionalDefault is MaskedCrossEntropy. Alternatives: ChunkedCrossEntropy (long sequences), FusedLinearCrossEntropy (large vocabs), TEParallelCrossEntropy (tensor-parallel).
rngOptionalControls reproducibility. Source: rng.py.
step_schedulerYesgrad_acc_steps sets how many micro-batches accumulate per gradient step. ckpt_every_steps and val_every_steps are counted in gradient steps.
distributedYesdp_size: null means auto-detect from world size. Adjust tp_size for tensor parallelism across GPUs.
checkpointRecommendedSet checkpoint_dir to a persistent path, especially in Docker.
optimizerOptionalDefaults are reasonable. Any torch.optim class can be substituted via _target_. For long fine-tuning, especially full-parameter SFT, see the mixed-precision training guide before combining torch AdamW with bf16 resident parameters.
wandbOptionalUncomment and configure to enable Weights & Biases logging.

For the fine-tuning recipe itself, see train_ft.py. For more example configs, browse examples/llm_finetune/.

Distributed Training: TP, PP, CP, and EP

The distributed: section controls how the model and data are split across GPUs. NeMo AutoModel supports four parallelism dimensions, each of which slices the workload differently:

DimensionKeyWhat it shardsWhen to use
Data Parallel (DP)dp_sizeReplicates the model on each group of GPUs; each replica trains on a different data batch.Default. Scales batch size linearly with GPU count.
Tensor Parallel (TP)tp_sizeSplits individual weight matrices (attention, MLP) across GPUs within a node.Model is too large to fit on a single GPU, or you want to reduce per-GPU memory at the cost of extra communication.
Pipeline Parallel (PP)pp_sizeAssigns different layers (stages) to different GPUs and pipelines micro-batches through them.Very deep models that don’t fit even with TP, or multi-node training where TP’s all-reduce is too expensive across nodes.
Context Parallel (CP)cp_sizeSplits the input sequence across GPUs so each GPU processes a portion of the context.Very long sequences that exceed single-GPU memory.
Expert Parallel (EP)ep_sizeDistributes MoE experts across GPUs so each GPU holds a subset of experts.Mixture-of-Experts models only.

These dimensions compose with each other. The relationship between them and total GPU count is:

world_size = pp_size Γ— dp_size Γ— cp_size Γ— tp_size

When dp_size is set to null (the default), it is inferred automatically:

dp_size = world_size Γ· (tp_size Γ— cp_size Γ— pp_size)

EP does not appear in this formula β€” experts are distributed across the DPΓ—CP rank groups, with the constraint that (dp_size Γ— cp_size) must be divisible by ep_size.

Data Parallel (default)

Data parallelism is the default. With strategy: fsdp2, FSDP2 shards both model parameters and optimizer states across the DP group, so memory usage shrinks as you add GPUs:

1distributed:
2 strategy: fsdp2
3 dp_size: null # auto-detected from world_size Γ· (tp Γ— cp Γ— pp)
4 tp_size: 1
5 cp_size: 1

Tensor Parallelism

TP splits weight matrices across GPUs within a single node. Set tp_size to the number of GPUs you want to shard over (typically 2, 4, or 8 β€” should divide evenly into the number of attention heads):

1distributed:
2 strategy: fsdp2
3 dp_size: null
4 tp_size: 4
5 cp_size: 1
6 sequence_parallel: false # set to true for additional memory savings

sequence_parallel: true extends TP to also shard activation memory along the sequence dimension, further reducing per-GPU memory at the cost of additional communication.

Pipeline Parallelism

PP assigns groups of layers to different GPUs and streams micro-batches through the stages. It requires an additional nested pipeline: section:

1distributed:
2 strategy: fsdp2
3 dp_size: null
4 tp_size: 4
5 pp_size: 4
6 cp_size: 1
7 activation_checkpointing: true
8
9 pipeline:
10 pp_schedule: interleaved1f1b # pipeline schedule (1f1b or interleaved1f1b)
11 pp_microbatch_size: 1 # micro-batch size per pipeline step
12 layers_per_stage: 4 # how many layers each stage handles
13 scale_grads_in_schedule: false
KeyRole
pp_scheduleThe micro-batch schedule. 1f1b is simpler; interleaved1f1b overlaps compute and communication for better throughput.
pp_microbatch_sizeNumber of samples per micro-batch fed into the pipeline. Must satisfy: local_batch_size Γ· pp_microbatch_size β‰₯ pp_size.
layers_per_stageHow many transformer layers each pipeline stage contains. If omitted, the framework splits layers evenly across pp_size stages.

PP requires the model to define a _pp_plan that tells the framework how to split layers into stages. All built-in models include this plan; custom models must add one.

Context Parallelism

CP splits the sequence across GPUs β€” useful for very long contexts that exceed single-GPU memory. Set cp_size to the desired split factor:

1distributed:
2 strategy: fsdp2
3 dp_size: null
4 tp_size: 1
5 cp_size: 2

When cp_size > 1, fused RoPE is automatically disabled. Some models also require the Transformer Engine (TE) attention backend for CP with packed sequences β€” the framework will raise an error with instructions if this applies.

Expert Parallelism (MoE models)

EP distributes MoE experts across GPUs. Set ep_size to the number of GPUs that share the full set of experts:

1distributed:
2 strategy: fsdp2
3 tp_size: 1
4 cp_size: 1
5 pp_size: 1
6 ep_size: 8
7 activation_checkpointing: true

EP only applies to Mixture-of-Experts models (e.g. Qwen3-MoE, Mixtral, DeepSeek-V3). For dense models, leave ep_size at 1 or omit it.

Combine Multiple Dimensions

You can combine TP, PP, CP, and EP in a single config. For example, a large MoE model on a multi-node cluster might use:

1distributed:
2 strategy: fsdp2
3 dp_size: null
4 tp_size: 1
5 cp_size: 2
6 pp_size: 1
7 ep_size: 4
8 activation_checkpointing: true

When choosing a combination, keep these rules in mind:

  • world_size must divide evenly into pp_size Γ— tp_size Γ— cp_size (the remainder becomes dp_size).
  • (dp_size Γ— cp_size) % ep_size == 0 β€” EP shares the DPΓ—CP groups.
  • TP within a node, PP across nodes is the typical layout β€” TP requires fast NVLink bandwidth, while PP tolerates higher latency.
  • Start simple. Use DP-only first. Add TP if the model doesn’t fit on one GPU. Add PP for very large models. Add CP for long sequences. Add EP only for MoE architectures.

Next Steps