Customization Concepts#

This page provides an overview of the customization concepts for the NeMo microservices platform.

Supervised Fine-Tuning#

Supervised fine-tuning (SFT) is a traditional technique for customizing a pre-trained model on labeled datasets for a specific task. It enhances the model’s performance on the task by updating the model’s weights. This traditional SFT technique trains and stores the entire model, which can be memory-intensive and time-consuming.

Parameter-Efficient Fine-Tuning#

Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient model customization by training a small number of parameters while keeping the base model frozen. For example, when customizing LLaMa 3.3 70B:

Traditional SFT: Trains and stores ~40 GB per task.
PEFT: Trains and stores only a few MB per task while maintaining comparable performance.

flowchart TD T1[Task 1] --> M1[LLaMa 3.3 - 70B] T2[Task 2] --> M2[LLaMa 3.3 - 70B] T3[Task 3] --> M3[LLaMa 3.3 - 70B] style M1 fill:#B8D5F2 style M2 fill:#B8D5F2 style M3 fill:#B8D5F2

Traditional Fine-Tuning#

flowchart TD P1[Task 1] --> A1[LoRA 113M] P2[Task 2] --> A2[LoRA 20M] P3[Task 3] --> A3[LoRA 10M] A1 & A2 & A3 --> M[LLaMa 3.3 - 70B] style A1 fill:#B8D5F2 style A2 fill:#B8D5F2 style A3 fill:#B8D5F2 style M fill:#90EE90

Parameter-Efficient Fine Tuning#

NeMo Customizer supports supervised fine-tuning (SFT) with the PEFT method of Low-Rank Adaptation (LoRA).

Fine-tuning with LoRA is the recommended starting point for most use cases.

LoRA#

In comparison to full fine-tuning of a base model, Low-Rank Adaptation (LoRA) has the following advantages:

Updates only a small portion of the base model’s weights
Comparable performance to full fine-tuning of a base model
LoRA is more memory and storage efficient than full fine-tuning

LoRA high level architecture:

flowchart LR API[Developer API] --> Input[("Input")] Input --> FW[("Frozen pretrained weights (W)")] Input --> RD[("Low-rank matrices (ΔW)")] FW & RD --> Add{{"W + ΔW"}} Add --> CM[Custom Model] classDef input fill:#fff,stroke:#68B749,stroke-width:2px,rx:15 classDef neural fill:#fff,stroke:#000,stroke-width:1px classDef add fill:#fff,stroke:#000,stroke-width:1px,rx:10 class Input,API input class FW,RD,CM neural class Add add

Low Rank Adaptation (LoRA)#

During LoRA training:

The original model’s pretrained weights (W) remain frozen
Small, trainable low-rank matrices are added to approximate weight updates (ΔW)
These updates are applied to the query and value matrices in the model’s attention layers
The final model combines the original weights with these updates (W + ΔW)

This approach:

Requires training only a fraction of the parameters
Maintains model quality comparable to full fine-tuning
Reduces memory usage and training time
Adds minimal latency during inference

Additional Resources:

Training with Your Own Data#

Use NeMo Customizer to train custom models on your own data. The workflow can be carried out as follows:

Upload a dataset
Train a custom model
Perform inference with the trained model

Truncating Long Dataset Samples#

Long samples in the dataset are truncated during training if the total token length exceeds the context supported by the model.

Note

Refer to the /v1/customization/configs response for a given configuration to see the max_length for that model.

Dataset Type	Token Counting	Length Management
Prompt Completion	• Total = prompt + completion tokens	• Truncates prompt tokens to fit limits • Filters out entries that still exceed maximum length
Conversational	• Total = conversation turns + template tokens • Templates are model-specific	• Truncates tokens beyond maximum limit • Preserves template formatting

Prompt Completion Datasets#

Below are some examples of how you might format your dataset to perform a handful of different tasks.

Note

When testing models trained with prompt/completion datasets, use the /v1/completions endpoint instead of /v1/chat/completions.

For details, refer to the Dataset Formatting tutorial.

Document Classification#

Classify a document into predefined categories. Each training example consists of:

A document to classify
Its corresponding class label

Format:

prompt: "Classify this document into one of the following classes: [class label 1, class label 2, class label 3]. Only specify one label per document.\n\n<document>\nClass: "
completion: "<class label>"

Extractive Q&A#

Extract an answer from a given context in response to a question. Each training example consists of:

A question to answer
A context passage containing the answer
The extracted answer

Format:

prompt: "<Question> context: <context> answer: "
completion: "<Answer>"

Simplification#

Simplify complex text into a clearer version. Each training example consists of:

A complex sentence or paragraph
Its simplified version

Format:

prompt: "Simplify the following sentence:\n<complex>\nsimple: "
completion: "<simple>"

Conversational Datasets#

Most of the models support Instruction Templates for training, the expected dataset conforms with the standard OpenAI messages format. Additionally, some models support tool calling which have additional optional parameters of tools at the top level of each entry and tool_calls per message.

For more information refer to our in-depth instructions.

Hyperparameters#

Hyperparameters are configuration settings used to control the training process. You’ll set these values before training begins to optimize how the model learns from your data. While the model automatically learns its internal parameters during training, these hyperparameters help guide that learning process. The right values depend on your specific use case, dataset size, and computational resources.

Parameter	Description	Recommended Value
Batch Size	Number of samples processed together before updating model parameters. Higher values improve efficiency but require more memory.	`16` (recommended; use 32 only if you have validated stability and sufficient memory)
Learning Rate	Step size for updating model parameters. Controls how much to adjust weights based on gradients.	`1e-3` to `1e-5`
Number of Epochs	Number of complete passes through the training dataset. Customizer performs automatic early stopping, halting training if there has been no improvement in validation score for ten epochs.	Varies by dataset size
Weight Decay	Regularization parameter to prevent overfitting by penalizing large weights.	`0.01` (default)
Adapter Dimension (LoRA)	Rank of the low-rank matrices. Affects model capacity and performance. The dimension of the adpater directly correlates with the number of parameters trained, higher dimensions require substantially more data to train.	Start with `8`, increase if needed
Adapter Alpha (LoRA)	Scaling factor for the LoRA update. Controls the magnitude of the low-rank approximation. A higher alpha value increases the impact of the LoRA weights, effectively amplifying the changes made to the original model.	Start with `16`, increase if model isn’t changing quickly enough, decrease if the model is overfitting or increasing in validation loss during training.
Adapter Dropout (LoRA)	Probability of dropping neurons during training to prevent overfitting.	`0.05` for smaller models (7B-13B)
Sequence Packing Enabled	Enables sequence packing in the job to optimize token/GPU throughput. Note: this is an experimental feature.	`false` by default

Parallelism#

NeMo Microservices Customizer supports various distributed training parallelization methods, which can be mixed together.

Tensor Parallelism#

Tensor Parallelism (TP) distributes the parameter tensor of an individual layer across GPUs. In addition to reducing model state memory usage, it also saves activation memory as the per-GPU tensor sizes shrink. The tradeoff is increased CPU overhead.

TP can be configured via tensor_parallel_size in the Customization Config.

Note

As of release 25.10.0, AutoModel engines including Phi-4, Qwen, and Gemma support tensor parallelism greater than 1 through the multi-GPU LoRA patch. Previous releases only supported TP=1 for these models.

Pipeline Parallelism#

Pipeline Parallelism (PP) distributes the layers of a neural network across GPUs. The GPUs then process the different layers sequentially.

PP can be configured via pipeline_parallel_size in the Customization Config.

Configuration#

Constraints
- TP must be less than or equal to the total number of GPUs available. It should be a factor of the total GPU count (divisible evenly).
Multi-node considerations
- TP can span across nodes, but this introduces network communication overhead. For multi-node setups, it’s often recommended to keep TP within a single node when possible. If using TP across nodes, high-bandwidth inter-node connections (like InfiniBand) become critical.
Example: if you have 2 nodes with 4 GPUs each, start with TP=4 first. This keeps all tensor parallel operations within a single node. If your model still uses too much GPU memory with this setting, increase to TP=8, which will distribute tensor operations across both nodes.
Performance
- Smaller TP values generally have less communication overhead.
- Larger TP values provide more memory savings but increase communication costs.

Sequence Parallelism#

Sequence Parallelism (SP) extends tensor-level model parallelism by distributing computing load and activation memory across multiple GPUs along the sequence dimension of transformer layers. This method is particularly useful when training on the datasets with longer sequences. It also benefits portions of the layer that have previously not been parallelized, enhancing overall model performance and efficiency.

Sequence Parallelism can be enabled/disabled in the Customization Config using the use_sequence_parallel key.

Sequence Packing#

Sequence packing is an efficient training optimization that combines multiple training examples into a single, longer sequence (called a pack). By eliminating the need for padding between sequences, this technique helps you:

Process more tokens in each micro batch
Maximize GPU compute efficiency
Optimize GPU memory usage

When enabled, the batch_size and number of training steps update so that each gradient iteration sees, on average, the same number of tokens compared to running fine-tuning without sequence packing.

Limitations#

Sequence packing is an experimental feature only supprted by the following models:
- meta/llama-3.1-8b-instruct
- meta/llama-3.1-70b-instruct
- meta/llama3-70b-instruct
- meta/llama-3.2-3b-instruct
- meta/llama-3.2-1b
- meta/llama-3.2-1b-instruct
Chat prompt templates do not have support for sequence packing.

Note

If hyperparameters.sequence_packing_enabled flag is enabled when using a model that does not support sequence packing, the fine-tuning will proceed without sequence packing and a warning will be returned in the API response.

Example of using in the API#

Example of creating a customization job with sequence packing enabled:

curl --location \
"https://${CUST_HOSTNAME}/v1/customization/jobs" \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--data '{
    "config": "meta/llama-3.1-8b-instruct",
    "dataset": {
        "name": "test-dataset"
    },
    "hyperparameters": {
        "sequence_packing_enabled": true,
        "training_type": "sft",
        "finetuning_type": "lora",
        "epochs": 10,
        "batch_size": 16,
        "learning_rate": 0.00001,
        "lora": {
            "adapter_dim": 16
        }
    }
}' | jq

Learn how to create a LoRA customization job with sequence packing by following the Optimizing for Tokens/GPU tutorial.