Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PEFT) with NeMo AutoModel
Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PEFT) with NeMo AutoModel
Introduction
Pretrained language models are general-purpose: they know a lot about language but nothing about your particular domain, terminology, or task. Fine-tuning bridges that gap β you fine-tune the model on your own examples so it produces answers that are accurate and relevant for your use case, without the cost of training a model from scratch. The result is a model optimized for your data that you can evaluate, publish, and deploy. This guide walks you through that process end-to-end with NeMo AutoModel β from installation through training, evaluation, and deployment β using Meta LLaMA 3.2 1B and the SQuAD v1.1 dataset as a running example.
NeMo AutoModel supports two fine-tuning modes:
- Supervised Fine-Tuning (SFT) updates all model parameters. Use SFT when you need maximum accuracy and have sufficient compute.
- Parameter-Efficient Fine-Tuning (PEFT) using LoRA freezes the base model and trains small low-rank adapters. PEFT reduces trainable parameters to less than 1% of the original model, lowering memory and storage costs.
Workflow Overview
Install NeMo AutoModel
Alternatively, if you run into dependency or driver issues, use the pre-built Docker container:
Docker containers are ephemeral β files written inside the container are lost when it stops. The -v flag in the docker run command above bind-mounts a local checkpoints/ directory into the container so that saved checkpoints persist across runs. For more details, see Save Checkpoints When Using Docker.
For the full set of installation methods, see the installation guide.
Configure Your Training Recipe
Training is configured through a YAML config file with three required sections β model, dataset, and step_scheduler β plus an optional peft section. The sections below walk through each one. For the complete copy-pastable file, see Full Config YAML.
Under the hood, both SFT and PEFT are executed by a recipe: a self-contained Python class that wires together model loading, dataset preparation, training, checkpointing, and logging. The fine-tuning recipe is TrainFinetuneRecipeForNextTokenPrediction. The config file tells the recipe what to build; the recipe decides how to build it.
How the Config System Works
NeMo AutoModel configs use a convention borrowed from Hydra: the special _target_ key tells the framework which Python class or function to call, and every other key in the same YAML block is passed as a keyword argument to that call. For example:
is equivalent to writing this Python code:
The _target_ value is a dotted Python import path: the same string you would use in an import statement. The framework resolves it at runtime by importing the module and looking up the attribute. This means you can point _target_ at any class constructor or factory function, and the remaining keys become its arguments.
To discover which parameters a section accepts, look up the Python signature of its _target_. For instance, torch.optim.Adam accepts lr, betas, eps, and weight_decay β those are the keys you can set in the YAML.
From YAML to running code. Here is the path a config takes through the framework:
Each builder function calls .instantiate() on its config section. .instantiate() does two things:
- Resolves
_target_β imports the Python path and obtains the callable (class or function). - Calls it β passes every other key in the section as a keyword argument.
Nested _target_ blocks (like collate_fn inside dataloader) are recursively instantiated the same way.
The recipe key. Every config file includes a top-level recipe key that tells the CLI which recipe class to run. You can write it as a short name or as a fully-qualified Python path β both resolve to the same class:
The short name form is a convenience β the CLI scans all recipe modules under nemo_automodel.recipes and matches the bare class name. If you invoke the recipe script directly with torchrun instead of the automodel CLI, the recipe key is not required because the script itself is the recipe.
Not every section uses _target_. Some sections like step_scheduler, distributed, and checkpoint are plain key-value groups consumed directly by the recipe β they control training schedule, parallelism strategy, and checkpoint behavior without instantiating a Python object.
Model
This guide uses Meta Llama 3.2 1B as a running example. Replace pretrained_model_name_or_path with any supported Hugging Face model ID.
About Llama 3.2 1B
Llama is a family of decoder-only transformer models developed by Meta. The 1B variant is a compact model suitable for research and edge deployment, featuring RoPE positional embeddings, grouped-query attention (GQA), and SwiGLU activations.
Accessing Gated Models
Some Hugging Face models are gated. If the model page shows a βRequest accessβ button:
- Log in with your Hugging Face account and accept the license.
- Ensure the token you use (from
huggingface-cli loginorHF_TOKEN) belongs to the approved account.
Pulling a gated model without an authorized token triggers a 403 error.
Dataset
This guide uses SQuAD v1.1 as a running example. Swap the dataset by changing _target_ and the dataset arguments β see Integrate Your Own Text Dataset and Dataset Overview: LLM, VLM, and Retrieval Datasets.
About SQuAD v1.1
The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset where each example consists of a Wikipedia passage, a question, and an answer span. SQuAD v1.1 guarantees all questions are answerable from the context, making it suitable for straightforward fine-tuning.
Example:
PEFT (Optional)
Including a peft: section enables LoRA fine-tuning. Remove it entirely to run SFT instead β see Switch Between SFT and PEFT.
QLoRA (Quantized Low-Rank Adaptation)
If GPU memory is a constraint, QLoRA combines LoRA with 4-bit NormalFloat (NF4) quantization to reduce memory usage by up to 75% compared to full-parameter SFT in 16-bit precision, while maintaining comparable quality to standard LoRA.
To enable QLoRA, add a quantization: section alongside the peft: section in your config. Note two differences from the standard PEFT config above: target_modules uses the broader "*_proj" pattern to apply LoRA to all projection layers (wider coverage compensates for precision loss from 4-bit weights), and dim is increased from 8 to 16 for additional adapter capacity.
Training Schedule
Unlike the sections above, step_scheduler has no _target_ β it is not instantiated into a Python object. Instead, the recipe reads its keys directly to control the training loop (how many epochs to run, when to checkpoint, when to validate). This is typical of sections that configure behavior rather than components.
All other settings (distributed strategy, optimizer, checkpointing, logging) use sensible defaults. See the Full Configuration Reference to customize them.
Most example recipes use bf16 training by default for memory and throughput. If you are running long fine-tuning, especially full-parameter SFT, and need higher-precision optimizer state, configure it explicitly instead of assuming it from the mixed-precision compute policy. See the mixed-precision training guide for the recommended TE and torch AdamW patterns.
Full Config YAML
finetune_config.yaml (Click to Expand)
Save as finetune_config.yaml. This config runs PEFT (LoRA). To run SFT instead, remove the peft: section. For production-ready examples, see the hosted configs: Llama 3.2 1B SFT and Llama 3.2 1B PEFT.
Fine-Tune the Model
You can run the recipe using the AutoModel CLI or directly with torchrun (advanced).
The --nproc-per-node=8 flag specifies the number of GPUs per node. Adjust as needed (for a single GPU, omit the --nproc-per-node option).
Invoke the Recipe Script Directly (Advanced)
Alternatively, you can invoke the recipe script directly using torchrun, as shown below.
Sample Output
Running the recipe with the automodel app or by invoking the recipe script directly produces the following log:
Each log line reports the current loss, gradient norm, peak GPU memory, and tokens per second (TPS). Small fluctuations between steps (e.g., 0.2057 to 0.2557 above) are normal β look at the overall downward trend rather than individual values.
Checkpoint Contents
Checkpoints are saved as Hugging Face-compatible safetensors. By default, SFT writes sharded model weights plus a generated model/consolidate.sh helper; run the helper after training to create model/consolidated/ for Transformers, vLLM, lm-eval-harness, and other Hugging Face ecosystem tools. You can also set save_consolidated: final to export consolidated HF weights only for the final checkpoint. Use save_consolidated: every (or legacy true) only when you intentionally want inline HF export at every checkpoint save. PEFT checkpoints contain only the adapter weights (~MBs instead of GBs) and are saved directly under model/. They do not use model/consolidate.sh β at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment).
Checkpoint Directory Structure
SFT checkpoint:
PEFT checkpoint:
Run Inference
Inference uses the Hugging Face generate API. Because exported SFT checkpoints are self-contained while PEFT checkpoints store only adapter weights (see Checkpoint Contents), the loading procedure differs between the two modes.
SFT Inference
If save_consolidated: false, first run the generated helper for the checkpoint you want to load:
The exported SFT checkpoint at model/consolidated/ is a complete Hugging Face model and can be loaded directly:
PEFT Inference
PEFT adapters must be loaded on top of the base model:
Evaluate the Fine-Tuned Model
During Training: Validation Loss
The recipe automatically computes validation loss at the interval set by val_every_steps. Look for [val] lines in the training log:
A decreasing validation loss across checkpoints indicates the model is learning. If validation loss plateaus or increases while training loss continues to drop, the model may be overfitting β consider stopping earlier or reducing the learning rate.
After Training: lm-eval-harness
For task-specific benchmarks (e.g., MMLU, GSM8K, HellaSwag accuracy), use lm-eval-harness with the fine-tuned checkpoint. For SFT runs with save_consolidated: false, run bash checkpoints/epoch_0_step_20/model/consolidate.sh before pointing evaluation at model/consolidated/:
The SFT example uses the vllm backend for faster evaluation (requires pip install vllm; see Deploy with vLLM for setup details). The PEFT example uses the hf backend with lm-evalβs built-in PEFT support to load the adapter on top of the base model.
Run lm-eval-harness on the base model before fine-tuning to establish a baseline, then compare against the fine-tuned checkpoint.
Publish to the Hugging Face Hub
Fine-tuned checkpoints and PEFT adapters are stored in Hugging Face-native format and can be uploaded directly to the Hub. For SFT runs with save_consolidated: false, upload model/consolidated/ after running the generated consolidation helper.
- Install the Hugging Face Hub library (if not already installed):
- Log in to Hugging Face:
- Upload:
SFT checkpoint:
PEFT adapter:
Once uploaded, load the checkpoint or adapter directly from the Hub:
SFT:
PEFT:
Deploy with vLLM
vLLM is an efficient inference engine for production deployment of LLMs.
Make sure vLLM is installed (pip install vllm, or use an environment that includes it).
SFT Checkpoint with vLLM
If save_consolidated: false, run the generated model/consolidate.sh helper before serving from model/consolidated/.
PEFT Adapter with vLLM
PEFT adapter serving uses the vLLMHFExporter class, which is provided by the nemo package β a separate dependency from nemo-automodel.
Install both packages before proceeding:
Full Configuration Reference
This section documents all available config fields for the fine-tuning recipe. For the quick-start config, see Configure Your Training Recipe.
Switch Between SFT and PEFT
The peft: section controls which mode runs:
All other config sections remain the same for both modes.
Full Configuration
Full Config
Config Field Reference
For the fine-tuning recipe itself, see train_ft.py. For more example configs, browse examples/llm_finetune/.
Distributed Training: TP, PP, CP, and EP
The distributed: section controls how the model and data are split across GPUs. NeMo AutoModel supports four parallelism dimensions, each of which slices the workload differently:
These dimensions compose with each other. The relationship between them and total GPU count is:
When dp_size is set to null (the default), it is inferred automatically:
EP does not appear in this formula β experts are distributed across the DPΓCP rank groups, with the constraint that (dp_size Γ cp_size) must be divisible by ep_size.
Data Parallel (default)
Data parallelism is the default. With strategy: fsdp2, FSDP2 shards both model parameters and optimizer states across the DP group, so memory usage shrinks as you add GPUs:
Tensor Parallelism
TP splits weight matrices across GPUs within a single node. Set tp_size to the number of GPUs you want to shard over (typically 2, 4, or 8 β should divide evenly into the number of attention heads):
sequence_parallel: true extends TP to also shard activation memory along the sequence dimension, further reducing per-GPU memory at the cost of additional communication.
Pipeline Parallelism
PP assigns groups of layers to different GPUs and streams micro-batches through the stages. It requires an additional nested pipeline: section:
PP requires the model to define a _pp_plan that tells the framework how to split layers into stages. All built-in models include this plan; custom models must add one.
Context Parallelism
CP splits the sequence across GPUs β useful for very long contexts that exceed single-GPU memory. Set cp_size to the desired split factor:
When cp_size > 1, fused RoPE is automatically disabled. Some models also require the Transformer Engine (TE) attention backend for CP with packed sequences β the framework will raise an error with instructions if this applies.
Expert Parallelism (MoE models)
EP distributes MoE experts across GPUs. Set ep_size to the number of GPUs that share the full set of experts:
EP only applies to Mixture-of-Experts models (e.g. Qwen3-MoE, Mixtral, DeepSeek-V3). For dense models, leave ep_size at 1 or omit it.
Combine Multiple Dimensions
You can combine TP, PP, CP, and EP in a single config. For example, a large MoE model on a multi-node cluster might use:
When choosing a combination, keep these rules in mind:
world_sizemust divide evenly intopp_size Γ tp_size Γ cp_size(the remainder becomesdp_size).(dp_size Γ cp_size) % ep_size == 0β EP shares the DPΓCP groups.- TP within a node, PP across nodes is the typical layout β TP requires fast NVLink bandwidth, while PP tolerates higher latency.
- Start simple. Use DP-only first. Add TP if the model doesnβt fit on one GPU. Add PP for very large models. Add CP for long sequences. Add EP only for MoE architectures.
Next Steps
- Integrate Your Own Text Dataset β swap the SQuAD example for your own data.
- Recipes and End-to-End Examples β browse the full set of recipes available in NeMo AutoModel. See also the
examples/llm_finetune/directory for ready-to-run configs. - Dataset Overview: LLM, VLM, and Retrieval Datasets β see all supported dataset types across LLM, VLM, and retrieval tasks.
- Knowledge Distillation β distill a fine-tuned model into a smaller one.