Training Configuration#
Tip
Want to learn about training concepts at a high level? Check out the Customization concepts page.
Complete Schema Reference#
For a complete list of training parameters and their valid values with constraints and types:
Properties
typeobject - Supervised Fine-Tuning.object - Knowledge Distillation with a teacher model.
Customizer's differentiator — not available in Unsloth.
Trains the student model to match the teacher's output distribution.object - Direct Preference Optimization.Properties
Properties
^[a-z0-9_-]+(/[a-z0-9_-]+)?$Properties
string - A reference to DeploymentParams.object - Inline deployment parameters for creating a new ModelDeploymentConfig.{'name': 'my-finetuned-llama'}Properties
my-finetuned-llamallama-3-8b-lora-v2^[\w\-.]+$Training is configured in the job’s spec.training object.
Quick Reference#
The training field is a discriminated union on the type field. Each training method inherits common hyperparameters and adds method-specific fields.
Training Method#
Parameter |
Values |
Description |
|---|---|---|
|
|
Training method (discriminated union) |
|
|
PEFT adapter configuration. If set, trains an adapter; if omitted, performs full-weight training |
For the SFT training schema with types and constraints:
Properties
Properties
Properties
4bit8bit4bitlora8320.0FalseFalse0.00010.010.90.99901321False2048fp8bf16fp16fp32Properties
11111FalsesftDPO Configuration#
When training.type is "dpo", additional DPO-specific fields are available:
Parameter |
Description |
Recommended Values |
|---|---|---|
|
KL divergence penalty (beta) |
|
|
Average log probabilities for preference |
|
|
Average log probabilities for SFT |
|
|
Weight for preference loss |
|
|
Weight for SFT loss |
|
For the DPO training schema with types and constraints:
Properties
Properties
Properties
4bit8bit4bitlora8320.0FalseFalse0.00010.010.90.99901321False2048fp8bf16fp16fp32Properties
11111Falsedpo0.05FalseFalse1.00.01.0Note
PEFT (LoRA) is not yet supported with DPO training. Use full-weight training by omitting the peft field.
Tip
When setting val_check_interval for DPO, use a fractional value (e.g., 0.5 for twice per epoch) or omit it entirely (validates once at end of epoch). Avoid integer step counts — they may not divide evenly into the total training steps, which can prevent validation from running on the final step.
Parallelism Configuration#
Parallelism parameters are grouped inside training.parallelism:
Parameter |
Description |
Notes |
|---|---|---|
|
Number of GPUs per node |
Default: |
|
Number of training nodes |
Use 1 unless multi-node setup |
|
GPUs for tensor parallelism |
Split layers across GPUs (for large models) |
|
GPUs for pipeline parallelism |
Split model stages across GPUs |
|
GPUs for context parallelism |
For very long sequences |
|
Expert parallelism for MoE models |
Must divide number of experts |
|
Enable sequence parallelism |
Memory optimization for long sequences |
Note
GPU Relationship: total_gpus = num_gpus_per_node x num_nodes
data_parallel_size is automatically derived as total_gpus / (TP × PP × CP).
PEFT / LoRA Configuration#
To train a LoRA adapter, set training.peft:
Parameter |
Description |
Recommended Values |
|---|---|---|
|
PEFT method type |
|
|
LoRA rank (low-rank dimension) |
|
|
LoRA alpha scaling factor |
|
|
LoRA dropout probability |
|
|
Module patterns to apply LoRA |
|
|
Merge LoRA weights into base model |
|
|
Enable DoRA (Weight-Decomposed Low-Rank Adaptation) |
|
"training": {
"type": "sft",
"peft": {
"type": "lora",
"rank": 8,
"alpha": 32,
"dropout": 0.0,
"target_modules": ["*.q_proj", "*.v_proj"] # Optional: specific modules
}
}
Distillation Configuration#
Note
Knowledge distillation is not yet supported.
Common Tuning Scenarios#
Loss Not Decreasing (Underfitting)#
"training": {
"type": "sft",
"peft": {"type": "lora"},
"epochs": 5, # Increase from 3
"learning_rate": 0.0001, # Increase from 5e-5
"warmup_steps": 50 # Add warmup
}
Out of Memory (OOM) Errors#
"training": {
"type": "sft",
"peft": {"type": "lora"},
"batch_size": 8, # Reduce from 32
"micro_batch_size": 1, # Reduce from 2
"max_seq_length": 1024 # Reduce from 2048
}
Overfitting (Validation Loss Increasing)#
"training": {
"type": "sft",
"peft": {
"type": "lora",
"rank": 8,
"dropout": 0.1 # Add dropout
},
"epochs": 2, # Reduce from 5
"learning_rate": 0.00002, # Lower to 2e-5
"weight_decay": 0.01
}
GPU Memory Guidelines#
Estimated GPU requirements by model size:
Model Size |
LoRA (1 GPU) |
Full FT (min GPUs) |
|---|---|---|
1B |
16GB |
1 × 24GB |
3B |
24GB |
2 × 24GB |
7-8B |
40GB |
2-4 × 80GB |
13B |
80GB |
4 × 80GB |
70B |
2 × 80GB |
8+ × 80GB |
Tip
Use LoRA for most fine-tuning tasks. It’s significantly more memory-efficient and often achieves comparable results to full fine-tuning.