Customization Concepts | NVIDIA NeMo Platform

This page provides an overview of the customization concepts for the NeMo Platform.

Supervised Fine-Tuning

Supervised fine-tuning (SFT) is a traditional technique for customizing a pre-trained model on labeled datasets for a specific task. It enhances the model’s performance on the task by updating the model’s weights. This traditional SFT technique trains and stores the entire model, which can be memory-intensive and time-consuming.

Deploying Full SFT Models

Full SFT models require a NIM deployment to serve inference. The Deployment Management Service supports two deployment modes:

Deployment Mode	Image Type	Weight Loading	Best For
Multi-LLM (Default)	Generic multi-model NIM	On-the-fly download via Files service	Supported Hugging Face architectures, custom fine-tuned models, development
Model-Specific NIM	Dedicated model image	Pre-download via model puller	Production, optimized performance and latency

Multi-LLM Image: Can deploy Hugging Face checkpoints whose architectures are supported by the image’s inference engine, providing flexibility for custom fine-tuned models. Importing a checkpoint does not guarantee training or deployment compatibility; for example, Automodel LoRA does not support Conv1D-based architectures. It also does not guarantee optimized inference performance.
Model-Specific NIM: Provides optimized inference performance and latency through model-specific optimizations. Recommended for production deployments where performance is critical.

Parameter-Efficient Fine-Tuning

Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient model customization by training a small number of parameters while keeping the base model frozen. For example, when customizing Llama 3.3 70B:

Traditional SFT: Produces a full BF16 checkpoint of approximately 140 GB per task. During training, budget free disk space separately for the base checkpoint, intermediate checkpoint, and final output—approximately 3× the downloaded base checkpoint size.
PEFT: Produces an adapter that is approximately 100–500 MB per task while maintaining comparable performance. During training, budget approximately 1.5× the downloaded base checkpoint size.

NeMo Customizer supports supervised fine-tuning (SFT) with the PEFT method of Low-Rank Adaptation (LoRA).

Fine-tuning with LoRA is the recommended starting point for most use cases.

LoRA

In comparison to full fine-tuning of a base model, Low-Rank Adaptation (LoRA) has the following advantages:

Updates only a small portion of the base model’s weights
Comparable performance to full fine-tuning of a base model
LoRA is more memory and storage efficient than full fine-tuning

LoRA high level architecture:

During LoRA training:

The original model’s pretrained weights (W) remain frozen
Small, trainable low-rank matrices are added to approximate weight updates (ΔW)
These updates are applied to the query and value matrices in the model’s attention layers
The final model combines the original weights with these updates (W + ΔW)

This approach:

Requires training only a fraction of the parameters
Maintains model quality comparable to full fine-tuning
Reduces memory usage and training time
Adds minimal latency during inference

Additional Resources:

Training with Your Own Data

Use NeMo Customizer to train custom models on your own data. The workflow can be carried out as follows:

Upload a dataset
Train a custom model
Perform inference with the trained model

Truncating Long Dataset Samples

Long samples in the dataset are truncated during training if the total token length exceeds the context supported by the model.

Refer to the model’s documentation to see the maximum supported sequence length.

Dataset Type	Token Counting	Length Management
Prompt Completion	• Total = prompt + completion tokens	• Truncates prompt tokens to fit limits • Filters out entries that still exceed maximum length
Conversational	• Total = conversation turns + template tokens • Templates are model-specific	• Truncates tokens beyond maximum limit • Preserves template formatting

Prompt Completion Datasets

Below are some examples of how you might format your dataset to perform a handful of different tasks.

When testing models trained with prompt/completion datasets, use the /v1/completions endpoint instead of /v1/chat/completions.

For details, refer to the Dataset Formatting tutorial.

Document Classification

Classify a document into predefined categories. Each training example consists of:

A document to classify
Its corresponding class label

Format:

prompt: "Classify this document into one of the following classes: [class label 1, class label 2, class label 3]. Only specify one label per document.\n\n<document>\nClass: "
completion: "<class label>"

Extractive Q&A

Extract an answer from a given context in response to a question. Each training example consists of:

A question to answer
A context passage containing the answer
The extracted answer

Format:

prompt: "<Question> context: <context> answer: "
completion: "<Answer>"

Simplification

Simplify complex text into a clearer version. Each training example consists of:

A complex sentence or paragraph
Its simplified version

Format:

prompt: "Simplify the following sentence:\n<complex>\nsimple: "
completion: "<simple>"

Conversational Datasets

Most of the models support Instruction Templates for training, the expected dataset conforms with the standard OpenAI messages format. Additionally, some models support tool calling which have additional optional parameters of tools at the top level of each entry and tool_calls per message.

For more information refer to our in-depth instructions.

Hyperparameters

Hyperparameters are configuration settings used to control the training process. You’ll set these values before training begins to optimize how the model learns from your data. While the model automatically learns its internal parameters during training, these hyperparameters help guide that learning process. The right values depend on your specific use case, dataset size, and computational resources.

Common hyperparameters you’ll tune include:

Hyperparameter	Description
Epochs	Number of complete passes through the training dataset
Batch size	Number of samples processed before updating model weights
Learning rate	Step size for weight updates during training
LoRA rank	Low-rank dimension of the adapter (lower = fewer parameters, higher = more expressive)
LoRA alpha	LoRA scaling factor

NeMo Customizer offers two training backends — Automodel (multi-GPU) and Unsloth (single-GPU, with optional quantized loading for LoRA) — and each accepts its own job configuration. Unsloth full-weight training requires unquantized model loading. The exact field names, defaults, and available knobs differ between them. For the full per-backend hyperparameter reference, see Training Configuration.

Parallelism

The Automodel backend supports several distributed training parallelization methods, which can be mixed together. The Unsloth backend runs on a single GPU and does not use these settings.

Tensor Parallelism

Tensor Parallelism (TP) distributes the parameter tensor of an individual layer across GPUs. In addition to reducing model state memory usage, it also saves activation memory as the per-GPU tensor sizes shrink. The tradeoff is increased CPU overhead.

TP can be configured via parallelism.tensor_parallel_size in the training configuration.

As of release 25.10.0, AutoModel engines including Phi-4, Qwen, and Gemma support tensor parallelism greater than 1 through the multi-GPU LoRA patch. Previous releases only supported TP=1 for these models.

Tensor Parallelism Configuration

Constraints

TP must be less than or equal to the total number of GPUs available.
TP should divide the total GPU count evenly.

Multi-node considerations

TP can span nodes, but doing so increases network communication overhead. For multi-node setups, keep TP within a single node when possible. High-bandwidth inter-node connections such as InfiniBand are important when TP must span nodes.

For example, with 2 nodes and 4 GPUs per node, start with TP=4 to keep tensor-parallel operations within each node. If the model still requires more memory, increase to TP=8 to distribute tensor operations across both nodes.

Performance

Smaller TP values generally have less communication overhead.
Larger TP values provide more memory savings but increase communication costs.

Pipeline Parallelism

Pipeline Parallelism (PP) distributes the layers of a neural network across GPUs. The GPUs then process the different layers sequentially.

PP can be configured via parallelism.pipeline_parallel_size in the training configuration.

Context Parallelism

Context Parallelism (CP) distributes activation memory along the sequence dimension across GPUs, which is particularly useful when training on datasets with very long sequences.

Context Parallelism can be configured via parallelism.context_parallel_size in the training configuration.

Sequence Packing

Sequence packing is an efficient training optimization that combines multiple training examples into a single, longer sequence (called a pack). By eliminating the need for padding between sequences, this technique helps you:

Process more tokens in each micro batch
Maximize GPU compute efficiency
Optimize GPU memory usage

When enabled, the effective batch size and number of training steps update so that each gradient iteration sees, on average, the same number of tokens compared to running fine-tuning without sequence packing.

Sequence packing is enabled per backend:

Automodel: set batch.sequence_packing to true.
Unsloth: set dataset.packing to true.

See Training Configuration for the full batch and dataset options.

Limitations

Sequence packing is an experimental feature whose support varies by model and backend.
Chat prompt templates do not have support for sequence packing.

If sequence packing is enabled for a model that does not support it, fine-tuning proceeds without sequence packing and a warning is returned in the API response.

Learn how to create a LoRA customization job with sequence packing by following the Optimizing for Tokens/GPU tutorial.

Next Steps

Review all backend-specific options in Training Configuration.
Create a LoRA customization job.
Create a Full SFT customization job.
Optimize training throughput with sequence packing.