Training Configuration#

Tip

Want to learn about training concepts at a high level? Check out the Customization concepts page.

Complete Schema Reference#

For a complete list of training parameters and their valid values with constraints and types:

CustomizationJobInput object
Input schema for creating customization jobs.
Properties
model * string
Model reference (e.g., 'workspace/model-name').
dataset * string
Dataset URI. Supported protocol: fileset:// (e.g., fileset://workspace/name).
training * object | object | object
Training method and hyperparameters.
Discriminator: property: type
One of:
Option 1: object - Supervised Fine-Tuning.
Option 2: object - Knowledge Distillation with a teacher model. Customizer's differentiator — not available in Unsloth. Trains the student model to match the teacher's output distribution.
Option 3: object - Direct Preference Optimization.
integrations object
Third-party integrations (e.g., Weights & Biases, MLflow).
Properties
wandb object
Weights & Biases integration configuration.
Properties
project string
W&B project name (groups related runs). Defaults to output.name if not set.
name string
W&B run name. Defaults to job_id if not provided.
entity string
W&B entity (team or username).
tags array
W&B tags for filtering runs.
Array items:
item string
notes string
W&B notes/description for the run.
base_url string
Base URL for self-hosted W&B server (e.g., 'https://wandb.mycompany.com'). If not provided, uses the default W&B cloud service.
api_key_secret string
Reference to a secret containing the WANDB_API_KEY. Format: 'secret_name' (uses request workspace) or 'workspace/secret_name' (explicit workspace).
Constraints: pattern: ^[a-z0-9_-]+(/[a-z0-9_-]+)?$
mlflow object
MLflow integration configuration.
Properties
experiment_name string
MLflow experiment name (groups related runs). Defaults to output.name if not set.
run_name string
MLflow run name. Defaults to job_id if not provided.
tags object
MLflow tags as key-value pairs for filtering runs.
Additional properties schema:
[key: string] string
description string
MLflow run description.
tracking_uri string
MLflow tracking server URI (e.g., 'http://mlflow.mycompany.com:5000'). Can also be set via MLFLOW_TRACKING_URI environment variable.
deployment_config string | object
Deployment configuration for auto-deploying the model after training. Pass a string to reference an existing ModelDeploymentConfig by name (e.g., 'my-config' or 'workspace/my-config'). An object provides inline NIM deployment parameters. Omit to skip deployment.
Any of:
Option 1: string - A reference to DeploymentParams.
Option 2: object - Inline deployment parameters for creating a new ModelDeploymentConfig.
custom_fields object
Custom user-defined fields.
Allows additional properties: Yes
output object
Output artifact configuration. If omitted, name is auto-generated as `{model}-{dataset}-`. The output type (model vs adapter) is always inferred from the training configuration.
Examples:{'name': 'my-finetuned-llama'}
Properties
name * string
Name of the output artifact. Used to identify it during deployment and inference.
Examples:my-finetuned-llamallama-3-8b-lora-v2
Constraints: max length: 255, pattern: ^[\w\-.]+$

Training is configured in the job’s spec.training object.

Quick Reference#

The training field is a discriminated union on the type field. Each training method inherits common hyperparameters and adds method-specific fields.

Training Method#

Parameter

Values

Description

training.type

sft, dpo, distillation

Training method (discriminated union)

training.peft

{ type: "lora", rank: 8, ... } or omit

PEFT adapter configuration. If set, trains an adapter; if omitted, performs full-weight training

For the SFT training schema with types and constraints:

SFTTrainingInput object
Supervised Fine-Tuning.
Properties
peft object
PEFT adapter configuration. If set, trains a parameter-efficient adapter. If omitted, performs full-weight fine-tuning.
Properties
quantization object
Enable quantized training to reduce GPU memory. If the base model is full-precision, it will be quantized at load time. If the base model is already pre-quantized, this configures the expected precision. The trained adapter remains full-precision.
Properties
precision string
Quantization precision. '4bit' (NF4) for maximum memory savings, '8bit' (LLM.int8) for a balance of quality and memory.
Allowed values:
4bit8bit
Default: 4bit
type string
Default: lora
rank integer
LoRA rank (low-rank dimension). Higher values increase capacity but use more memory.
Constraints: minimum: 1.0, maximum: 256.0
Default: 8
alpha integer
LoRA alpha scaling factor. Common practice: alpha = 2-4x rank.
Constraints: minimum: 1.0
Default: 32
dropout number
LoRA dropout probability for regularization.
Constraints: minimum: 0.0, maximum: 1.0
Default: 0.0
target_modules array
Module name patterns to apply LoRA to (e.g., ['*.q_proj', '*.v_proj']). If not set, applies to all '*proj' linear layers.
Array items:
item string
merge boolean
Merge LoRA weights into base model after training. Produces a full-weight checkpoint instead of an adapter.
Default: False
use_dora boolean
Enable DoRA (Weight-Decomposed Low-Rank Adaptation). Decomposes weight updates into magnitude and direction components. Can improve quality especially at low ranks, but adds training overhead.
Default: False
learning_rate number
Peak learning rate. Optimal value will depend on training type and PEFT. For SFT without LoRA, start with 5e-5. If using LoRA start with 1e-4. Lowering the value can enable for slower, more precise training; Raising the value speeds up learning.
Default: 0.0001
min_learning_rate number
Minimum learning rate for cosine decay. Optional; used with learning rate schedules.
weight_decay number
Weight decay coefficient. Helps prevent overfitting.
Default: 0.01
adam_beta1 number
Adam beta1 parameter. Adjust for optimizer tuning.
Default: 0.9
adam_beta2 number
Adam beta2 parameter. Adjust for optimizer tuning.
Default: 0.999
warmup_steps integer
Linear warmup steps. Recommended: 10% of total training steps for stable training.
Constraints: minimum: 0.0
Default: 0
optimizer string
Optimizer name (e.g., 'adamw').
epochs integer
Number of complete passes through the dataset. The ideal number of epochs depends on the training method, the number of training samples, and size of the model. Start with 3 for a reasonable value. Monitor the validation and training loss curves. If both are still decreasing, you can increase this number.
Constraints: exclusive min: 0.0
Default: 1
max_steps integer
Max training steps. Overrides epochs if set.
log_every_n_steps integer
Logging frequency in steps. Controls how often training metrics are logged.
val_check_interval number
Validation interval. Float <= 1.0 is fraction of epoch; > 1.0 is step count.
batch_size integer
Global batch size across all GPUs. Higher = faster but more memory. If OOM, reduce this first.
Constraints: exclusive min: 0.0
Default: 32
micro_batch_size integer
Per-GPU micro batch size. Keep small (1-2) for large models to avoid OOM.
Constraints: exclusive min: 0.0
Default: 1
sequence_packing boolean
Enable sequence packing for efficiency. Can improve training speed.
Default: False
max_seq_length integer
Maximum token sequence length for training. Higher = more memory, longer training.
Constraints: exclusive min: 0.0
Default: 2048
precision string
Model precision for training. Auto-detected if unset.
Allowed values:
fp8bf16fp16fp32
seed integer
Random seed for reproducibility. Optional.
parallelism object
Distributed training parallelism configuration. Most users only need num_gpus_per_node. Advanced users can configure tensor/pipeline/context/expert parallelism for large models.
Properties
num_gpus_per_node integer
Number of gpus per node.
Constraints: exclusive min: 0.0
Default: 1
num_nodes integer
Number of nodes.
Constraints: exclusive min: 0.0
Default: 1
tensor_parallel_size integer
Tensor parallel size.
Constraints: exclusive min: 0.0
Default: 1
pipeline_parallel_size integer
Pipeline parallel size.
Constraints: exclusive min: 0.0
Default: 1
context_parallel_size integer
Context parallel size.
Constraints: exclusive min: 0.0
Default: 1
expert_parallel_size integer
Expert parallel size (MoE models).
sequence_parallel boolean
Enable sequence parallelism.
Default: False
type string
Default: sft

DPO Configuration#

When training.type is "dpo", additional DPO-specific fields are available:

Parameter

Description

Recommended Values

ref_policy_kl_penalty

KL divergence penalty (beta)

0.05-0.5

preference_average_log_probs

Average log probabilities for preference

false

sft_average_log_probs

Average log probabilities for SFT

false

preference_loss_weight

Weight for preference loss

1.0

sft_loss_weight

Weight for SFT loss

0.0

For the DPO training schema with types and constraints:

DPOTrainingInput object
Direct Preference Optimization.
Properties
peft object
PEFT adapter configuration. If set, trains a parameter-efficient adapter. If omitted, performs full-weight fine-tuning.
Properties
quantization object
Enable quantized training to reduce GPU memory. If the base model is full-precision, it will be quantized at load time. If the base model is already pre-quantized, this configures the expected precision. The trained adapter remains full-precision.
Properties
precision string
Quantization precision. '4bit' (NF4) for maximum memory savings, '8bit' (LLM.int8) for a balance of quality and memory.
Allowed values:
4bit8bit
Default: 4bit
type string
Default: lora
rank integer
LoRA rank (low-rank dimension). Higher values increase capacity but use more memory.
Constraints: minimum: 1.0, maximum: 256.0
Default: 8
alpha integer
LoRA alpha scaling factor. Common practice: alpha = 2-4x rank.
Constraints: minimum: 1.0
Default: 32
dropout number
LoRA dropout probability for regularization.
Constraints: minimum: 0.0, maximum: 1.0
Default: 0.0
target_modules array
Module name patterns to apply LoRA to (e.g., ['*.q_proj', '*.v_proj']). If not set, applies to all '*proj' linear layers.
Array items:
item string
merge boolean
Merge LoRA weights into base model after training. Produces a full-weight checkpoint instead of an adapter.
Default: False
use_dora boolean
Enable DoRA (Weight-Decomposed Low-Rank Adaptation). Decomposes weight updates into magnitude and direction components. Can improve quality especially at low ranks, but adds training overhead.
Default: False
learning_rate number
Peak learning rate. Optimal value will depend on training type and PEFT. For SFT without LoRA, start with 5e-5. If using LoRA start with 1e-4. Lowering the value can enable for slower, more precise training; Raising the value speeds up learning.
Default: 0.0001
min_learning_rate number
Minimum learning rate for cosine decay. Optional; used with learning rate schedules.
weight_decay number
Weight decay coefficient. Helps prevent overfitting.
Default: 0.01
adam_beta1 number
Adam beta1 parameter. Adjust for optimizer tuning.
Default: 0.9
adam_beta2 number
Adam beta2 parameter. Adjust for optimizer tuning.
Default: 0.999
warmup_steps integer
Linear warmup steps. Recommended: 10% of total training steps for stable training.
Constraints: minimum: 0.0
Default: 0
optimizer string
Optimizer name (e.g., 'adamw').
epochs integer
Number of complete passes through the dataset. The ideal number of epochs depends on the training method, the number of training samples, and size of the model. Start with 3 for a reasonable value. Monitor the validation and training loss curves. If both are still decreasing, you can increase this number.
Constraints: exclusive min: 0.0
Default: 1
max_steps integer
Max training steps. Overrides epochs if set.
log_every_n_steps integer
Logging frequency in steps. Controls how often training metrics are logged.
val_check_interval number
Validation interval. Float <= 1.0 is fraction of epoch; > 1.0 is step count.
batch_size integer
Global batch size across all GPUs. Higher = faster but more memory. If OOM, reduce this first.
Constraints: exclusive min: 0.0
Default: 32
micro_batch_size integer
Per-GPU micro batch size. Keep small (1-2) for large models to avoid OOM.
Constraints: exclusive min: 0.0
Default: 1
sequence_packing boolean
Enable sequence packing for efficiency. Can improve training speed.
Default: False
max_seq_length integer
Maximum token sequence length for training. Higher = more memory, longer training.
Constraints: exclusive min: 0.0
Default: 2048
precision string
Model precision for training. Auto-detected if unset.
Allowed values:
fp8bf16fp16fp32
seed integer
Random seed for reproducibility. Optional.
parallelism object
Distributed training parallelism configuration. Most users only need num_gpus_per_node. Advanced users can configure tensor/pipeline/context/expert parallelism for large models.
Properties
num_gpus_per_node integer
Number of gpus per node.
Constraints: exclusive min: 0.0
Default: 1
num_nodes integer
Number of nodes.
Constraints: exclusive min: 0.0
Default: 1
tensor_parallel_size integer
Tensor parallel size.
Constraints: exclusive min: 0.0
Default: 1
pipeline_parallel_size integer
Pipeline parallel size.
Constraints: exclusive min: 0.0
Default: 1
context_parallel_size integer
Context parallel size.
Constraints: exclusive min: 0.0
Default: 1
expert_parallel_size integer
Expert parallel size (MoE models).
sequence_parallel boolean
Enable sequence parallelism.
Default: False
type string
Default: dpo
ref_policy_kl_penalty number
KL penalty coefficient (beta in DPO paper).
Constraints: minimum: 0.0
Default: 0.05
preference_average_log_probs boolean
Average log probabilities for preference loss calculation.
Default: False
sft_average_log_probs boolean
Average log probabilities for SFT regularization loss.
Default: False
preference_loss_weight number
Weight for the preference (DPO) loss term.
Constraints: minimum: 0.0
Default: 1.0
sft_loss_weight number
Weight for SFT regularization loss (0 = disabled).
Constraints: minimum: 0.0
Default: 0.0
max_grad_norm number
Maximum gradient norm for clipping.
Constraints: minimum: 0.0
Default: 1.0

Note

PEFT (LoRA) is not yet supported with DPO training. Use full-weight training by omitting the peft field.

Tip

When setting val_check_interval for DPO, use a fractional value (e.g., 0.5 for twice per epoch) or omit it entirely (validates once at end of epoch). Avoid integer step counts — they may not divide evenly into the total training steps, which can prevent validation from running on the final step.

Parallelism Configuration#

Parallelism parameters are grouped inside training.parallelism:

Parameter

Description

Notes

parallelism.num_gpus_per_node

Number of GPUs per node

Default: 1

parallelism.num_nodes

Number of training nodes

Use 1 unless multi-node setup

parallelism.tensor_parallel_size

GPUs for tensor parallelism

Split layers across GPUs (for large models)

parallelism.pipeline_parallel_size

GPUs for pipeline parallelism

Split model stages across GPUs

parallelism.context_parallel_size

GPUs for context parallelism

For very long sequences

parallelism.expert_parallel_size

Expert parallelism for MoE models

Must divide number of experts

parallelism.sequence_parallel

Enable sequence parallelism

Memory optimization for long sequences

Note

GPU Relationship: total_gpus = num_gpus_per_node x num_nodes

data_parallel_size is automatically derived as total_gpus / (TP × PP × CP).

PEFT / LoRA Configuration#

To train a LoRA adapter, set training.peft:

Parameter

Description

Recommended Values

peft.type

PEFT method type

"lora" (currently the only supported method)

peft.rank

LoRA rank (low-rank dimension)

8-64. Higher = more capacity, more memory

peft.alpha

LoRA alpha scaling factor

2-4× rank (e.g., 32 for rank 8)

peft.dropout

LoRA dropout probability

0.0-0.1 for regularization

peft.target_modules

Module patterns to apply LoRA

null = all linear layers (default)

peft.merge

Merge LoRA weights into base model

false (default). If true, produces full-weight checkpoint

peft.use_dora

Enable DoRA (Weight-Decomposed Low-Rank Adaptation)

false (default)

"training": {
    "type": "sft",
    "peft": {
        "type": "lora",
        "rank": 8,
        "alpha": 32,
        "dropout": 0.0,
        "target_modules": ["*.q_proj", "*.v_proj"]  # Optional: specific modules
    }
}

Distillation Configuration#

Note

Knowledge distillation is not yet supported.


Common Tuning Scenarios#

Loss Not Decreasing (Underfitting)#

"training": {
    "type": "sft",
    "peft": {"type": "lora"},
    "epochs": 5,  # Increase from 3
    "learning_rate": 0.0001,  # Increase from 5e-5
    "warmup_steps": 50  # Add warmup
}

Out of Memory (OOM) Errors#

"training": {
    "type": "sft",
    "peft": {"type": "lora"},
    "batch_size": 8,  # Reduce from 32
    "micro_batch_size": 1,  # Reduce from 2
    "max_seq_length": 1024  # Reduce from 2048
}

Overfitting (Validation Loss Increasing)#

"training": {
    "type": "sft",
    "peft": {
        "type": "lora",
        "rank": 8,
        "dropout": 0.1  # Add dropout
    },
    "epochs": 2,  # Reduce from 5
    "learning_rate": 0.00002,  # Lower to 2e-5
    "weight_decay": 0.01
}

GPU Memory Guidelines#

Estimated GPU requirements by model size:

Model Size

LoRA (1 GPU)

Full FT (min GPUs)

1B

16GB

1 × 24GB

3B

24GB

2 × 24GB

7-8B

40GB

2-4 × 80GB

13B

80GB

4 × 80GB

70B

2 × 80GB

8+ × 80GB

Tip

Use LoRA for most fine-tuning tasks. It’s significantly more memory-efficient and often achieves comparable results to full fine-tuning.