Training Configuration#

Tip

Want to learn about training concepts at a high level? Check out the Customization concepts page.

Complete Schema Reference#

For a complete list of training parameters and their valid values with constraints and types:

CustomizationJobInput object

Input schema for creating customization jobs.

Properties

model * string

Model reference (e.g., 'workspace/model-name').

dataset * string

Dataset URI. Supported protocol: fileset:// (e.g., fileset://workspace/name).

training * object | object | object

Training method and hyperparameters.

Discriminator: property: type

One of:

Option 1: object - Supervised Fine-Tuning.

Option 2: object - Knowledge Distillation with a teacher model. Customizer's differentiator — not available in Unsloth. Trains the student model to match the teacher's output distribution.

Option 3: object - Direct Preference Optimization.

integrations object

Third-party integrations (e.g., Weights & Biases, MLflow).

Properties

wandb object

Weights & Biases integration configuration.

Properties

project string

W&B project name (groups related runs). Defaults to output.name if not set.

name string

W&B run name. Defaults to job_id if not provided.

entity string

W&B entity (team or username).

tags array

W&B tags for filtering runs.

Array items:

item string

notes string

W&B notes/description for the run.

base_url string

Base URL for self-hosted W&B server (e.g., 'https://wandb.mycompany.com'). If not provided, uses the default W&B cloud service.

api_key_secret string

Reference to a secret containing the WANDB_API_KEY. Format: 'secret_name' (uses request workspace) or 'workspace/secret_name' (explicit workspace).

Constraints: pattern: ^[a-z0-9_-]+(/[a-z0-9_-]+)?$

mlflow object

MLflow integration configuration.

Properties

experiment_name string

MLflow experiment name (groups related runs). Defaults to output.name if not set.

run_name string

MLflow run name. Defaults to job_id if not provided.

tags object

MLflow tags as key-value pairs for filtering runs.

Additional properties schema:

[key: string] string

description string

MLflow run description.

tracking_uri string

MLflow tracking server URI (e.g., 'http://mlflow.mycompany.com:5000'). Can also be set via MLFLOW_TRACKING_URI environment variable.

deployment_config string | object

Deployment configuration for auto-deploying the model after training. Pass a string to reference an existing ModelDeploymentConfig by name (e.g., 'my-config' or 'workspace/my-config'). An object provides inline NIM deployment parameters. Omit to skip deployment.

Any of:

Option 1: string - A reference to DeploymentParams.

Option 2: object - Inline deployment parameters for creating a new ModelDeploymentConfig.

custom_fields object

Custom user-defined fields.

Allows additional properties: Yes

output object

Output artifact configuration. If omitted, name is auto-generated as `{model}-{dataset}-`. The output type (model vs adapter) is always inferred from the training configuration.

Examples:{'name': 'my-finetuned-llama'}

Properties

name * string

Name of the output artifact. Used to identify it during deployment and inference.

Examples:my-finetuned-llamallama-3-8b-lora-v2

Constraints: max length: 255, pattern: ^[\w\-.]+$

Training is configured in the job’s spec.training object.

Quick Reference#

The training field is a discriminated union on the type field. Each training method inherits common hyperparameters and adds method-specific fields.

Training Method#

Parameter	Values	Description
`training.type`	`sft`, `dpo`, `distillation`	Training method (discriminated union)
`training.peft`	`{ type: "lora", rank: 8, ... }` or omit	PEFT adapter configuration. If set, trains an adapter; if omitted, performs full-weight training

For the SFT training schema with types and constraints:

SFTTrainingInput object

Supervised Fine-Tuning.

Properties

peft object

PEFT adapter configuration. If set, trains a parameter-efficient adapter. If omitted, performs full-weight fine-tuning.

Properties

quantization object

Enable quantized training to reduce GPU memory. If the base model is full-precision, it will be quantized at load time. If the base model is already pre-quantized, this configures the expected precision. The trained adapter remains full-precision.

Properties

precision string

Quantization precision. '4bit' (NF4) for maximum memory savings, '8bit' (LLM.int8) for a balance of quality and memory.

Allowed values:

4bit8bit

Default: 4bit

type string

Default: lora

rank integer

LoRA rank (low-rank dimension). Higher values increase capacity but use more memory.

Constraints: minimum: 1.0, maximum: 256.0

Default: 8

alpha integer

LoRA alpha scaling factor. Common practice: alpha = 2-4x rank.

Constraints: minimum: 1.0

Default: 32

dropout number

LoRA dropout probability for regularization.

Constraints: minimum: 0.0, maximum: 1.0

Default: 0.0

target_modules array

Module name patterns to apply LoRA to (e.g., ['*.q_proj', '*.v_proj']). If not set, applies to all '*proj' linear layers.

Array items:

item string

merge boolean

Merge LoRA weights into base model after training. Produces a full-weight checkpoint instead of an adapter.

Default: False

use_dora boolean

Enable DoRA (Weight-Decomposed Low-Rank Adaptation). Decomposes weight updates into magnitude and direction components. Can improve quality especially at low ranks, but adds training overhead.

Default: False

learning_rate number

Peak learning rate. Optimal value will depend on training type and PEFT. For SFT without LoRA, start with 5e-5. If using LoRA start with 1e-4. Lowering the value can enable for slower, more precise training; Raising the value speeds up learning.

Default: 0.0001

min_learning_rate number

Minimum learning rate for cosine decay. Optional; used with learning rate schedules.

weight_decay number

Weight decay coefficient. Helps prevent overfitting.

Default: 0.01

adam_beta1 number

Adam beta1 parameter. Adjust for optimizer tuning.

Default: 0.9

adam_beta2 number

Adam beta2 parameter. Adjust for optimizer tuning.

Default: 0.999

warmup_steps integer

Linear warmup steps. Recommended: 10% of total training steps for stable training.

Constraints: minimum: 0.0

Default: 0

optimizer string

Optimizer name (e.g., 'adamw').

epochs integer

Number of complete passes through the dataset. The ideal number of epochs depends on the training method, the number of training samples, and size of the model. Start with 3 for a reasonable value. Monitor the validation and training loss curves. If both are still decreasing, you can increase this number.

Constraints: exclusive min: 0.0

Default: 1

max_steps integer

Max training steps. Overrides epochs if set.

log_every_n_steps integer

Logging frequency in steps. Controls how often training metrics are logged.

val_check_interval number

Validation interval. Float <= 1.0 is fraction of epoch; > 1.0 is step count.

batch_size integer

Global batch size across all GPUs. Higher = faster but more memory. If OOM, reduce this first.

Constraints: exclusive min: 0.0

Default: 32

micro_batch_size integer

Per-GPU micro batch size. Keep small (1-2) for large models to avoid OOM.

Constraints: exclusive min: 0.0

Default: 1

sequence_packing boolean

Enable sequence packing for efficiency. Can improve training speed.

Default: False

max_seq_length integer

Maximum token sequence length for training. Higher = more memory, longer training.

Constraints: exclusive min: 0.0

Default: 2048

precision string

Model precision for training. Auto-detected if unset.

Allowed values:

fp8bf16fp16fp32

seed integer

Random seed for reproducibility. Optional.

parallelism object

Distributed training parallelism configuration. Most users only need num_gpus_per_node. Advanced users can configure tensor/pipeline/context/expert parallelism for large models.

Properties

num_gpus_per_node integer

Number of gpus per node.

Constraints: exclusive min: 0.0

Default: 1

num_nodes integer

Number of nodes.

Constraints: exclusive min: 0.0

Default: 1

tensor_parallel_size integer

Tensor parallel size.

Constraints: exclusive min: 0.0

Default: 1

pipeline_parallel_size integer

Pipeline parallel size.

Constraints: exclusive min: 0.0

Default: 1

context_parallel_size integer

Context parallel size.

Constraints: exclusive min: 0.0

Default: 1

expert_parallel_size integer

Expert parallel size (MoE models).

sequence_parallel boolean

Enable sequence parallelism.

Default: False

type string

Default: sft

DPO Configuration#

When training.type is "dpo", additional DPO-specific fields are available:

Parameter	Description	Recommended Values
`ref_policy_kl_penalty`	KL divergence penalty (beta)	`0.05-0.5`
`preference_average_log_probs`	Average log probabilities for preference	`false`
`sft_average_log_probs`	Average log probabilities for SFT	`false`
`preference_loss_weight`	Weight for preference loss	`1.0`
`sft_loss_weight`	Weight for SFT loss	`0.0`

For the DPO training schema with types and constraints:

DPOTrainingInput object

Direct Preference Optimization.

Properties

peft object

PEFT adapter configuration. If set, trains a parameter-efficient adapter. If omitted, performs full-weight fine-tuning.

Properties

quantization object

Enable quantized training to reduce GPU memory. If the base model is full-precision, it will be quantized at load time. If the base model is already pre-quantized, this configures the expected precision. The trained adapter remains full-precision.

Properties

precision string

Quantization precision. '4bit' (NF4) for maximum memory savings, '8bit' (LLM.int8) for a balance of quality and memory.

Allowed values:

4bit8bit

Default: 4bit

type string

Default: lora

rank integer

LoRA rank (low-rank dimension). Higher values increase capacity but use more memory.

Constraints: minimum: 1.0, maximum: 256.0

Default: 8

alpha integer

LoRA alpha scaling factor. Common practice: alpha = 2-4x rank.

Constraints: minimum: 1.0

Default: 32

dropout number

LoRA dropout probability for regularization.

Constraints: minimum: 0.0, maximum: 1.0

Default: 0.0

target_modules array

Module name patterns to apply LoRA to (e.g., ['*.q_proj', '*.v_proj']). If not set, applies to all '*proj' linear layers.

Array items:

item string

merge boolean

Merge LoRA weights into base model after training. Produces a full-weight checkpoint instead of an adapter.

Default: False

use_dora boolean

Enable DoRA (Weight-Decomposed Low-Rank Adaptation). Decomposes weight updates into magnitude and direction components. Can improve quality especially at low ranks, but adds training overhead.

Default: False

learning_rate number

Peak learning rate. Optimal value will depend on training type and PEFT. For SFT without LoRA, start with 5e-5. If using LoRA start with 1e-4. Lowering the value can enable for slower, more precise training; Raising the value speeds up learning.

Default: 0.0001

min_learning_rate number

Minimum learning rate for cosine decay. Optional; used with learning rate schedules.

weight_decay number

Weight decay coefficient. Helps prevent overfitting.

Default: 0.01

adam_beta1 number

Adam beta1 parameter. Adjust for optimizer tuning.

Default: 0.9

adam_beta2 number

Adam beta2 parameter. Adjust for optimizer tuning.

Default: 0.999

warmup_steps integer

Linear warmup steps. Recommended: 10% of total training steps for stable training.

Constraints: minimum: 0.0

Default: 0

optimizer string

Optimizer name (e.g., 'adamw').

epochs integer

Number of complete passes through the dataset. The ideal number of epochs depends on the training method, the number of training samples, and size of the model. Start with 3 for a reasonable value. Monitor the validation and training loss curves. If both are still decreasing, you can increase this number.

Constraints: exclusive min: 0.0

Default: 1

max_steps integer

Max training steps. Overrides epochs if set.

log_every_n_steps integer

Logging frequency in steps. Controls how often training metrics are logged.

val_check_interval number

Validation interval. Float <= 1.0 is fraction of epoch; > 1.0 is step count.

batch_size integer

Global batch size across all GPUs. Higher = faster but more memory. If OOM, reduce this first.

Constraints: exclusive min: 0.0

Default: 32

micro_batch_size integer

Per-GPU micro batch size. Keep small (1-2) for large models to avoid OOM.

Constraints: exclusive min: 0.0

Default: 1

sequence_packing boolean

Enable sequence packing for efficiency. Can improve training speed.

Default: False

max_seq_length integer

Maximum token sequence length for training. Higher = more memory, longer training.

Constraints: exclusive min: 0.0

Default: 2048

precision string

Model precision for training. Auto-detected if unset.

Allowed values:

fp8bf16fp16fp32

seed integer

Random seed for reproducibility. Optional.

parallelism object

Distributed training parallelism configuration. Most users only need num_gpus_per_node. Advanced users can configure tensor/pipeline/context/expert parallelism for large models.

Properties

num_gpus_per_node integer

Number of gpus per node.

Constraints: exclusive min: 0.0

Default: 1

num_nodes integer

Number of nodes.

Constraints: exclusive min: 0.0

Default: 1

tensor_parallel_size integer

Tensor parallel size.

Constraints: exclusive min: 0.0

Default: 1

pipeline_parallel_size integer

Pipeline parallel size.

Constraints: exclusive min: 0.0

Default: 1

context_parallel_size integer

Context parallel size.

Constraints: exclusive min: 0.0

Default: 1

expert_parallel_size integer

Expert parallel size (MoE models).

sequence_parallel boolean

Enable sequence parallelism.

Default: False

type string

Default: dpo

ref_policy_kl_penalty number

KL penalty coefficient (beta in DPO paper).

Constraints: minimum: 0.0

Default: 0.05

preference_average_log_probs boolean

Average log probabilities for preference loss calculation.

Default: False

sft_average_log_probs boolean

Average log probabilities for SFT regularization loss.

Default: False

preference_loss_weight number

Weight for the preference (DPO) loss term.

Constraints: minimum: 0.0

Default: 1.0

sft_loss_weight number

Weight for SFT regularization loss (0 = disabled).

Constraints: minimum: 0.0

Default: 0.0

max_grad_norm number

Maximum gradient norm for clipping.

Constraints: minimum: 0.0

Default: 1.0

Note

PEFT (LoRA) is not yet supported with DPO training. Use full-weight training by omitting the peft field.

Tip

When setting val_check_interval for DPO, use a fractional value (e.g., 0.5 for twice per epoch) or omit it entirely (validates once at end of epoch). Avoid integer step counts — they may not divide evenly into the total training steps, which can prevent validation from running on the final step.

Parallelism Configuration#

Parallelism parameters are grouped inside training.parallelism:

Parameter	Description	Notes
`parallelism.num_gpus_per_node`	Number of GPUs per node	Default: `1`
`parallelism.num_nodes`	Number of training nodes	Use 1 unless multi-node setup
`parallelism.tensor_parallel_size`	GPUs for tensor parallelism	Split layers across GPUs (for large models)
`parallelism.pipeline_parallel_size`	GPUs for pipeline parallelism	Split model stages across GPUs
`parallelism.context_parallel_size`	GPUs for context parallelism	For very long sequences
`parallelism.expert_parallel_size`	Expert parallelism for MoE models	Must divide number of experts
`parallelism.sequence_parallel`	Enable sequence parallelism	Memory optimization for long sequences

Note

GPU Relationship: total_gpus = num_gpus_per_node x num_nodes

data_parallel_size is automatically derived as total_gpus / (TP × PP × CP).

PEFT / LoRA Configuration#

To train a LoRA adapter, set training.peft:

Parameter	Description	Recommended Values
`peft.type`	PEFT method type	`"lora"` (currently the only supported method)
`peft.rank`	LoRA rank (low-rank dimension)	`8-64`. Higher = more capacity, more memory
`peft.alpha`	LoRA alpha scaling factor	`2-4× rank` (e.g., `32` for rank `8`)
`peft.dropout`	LoRA dropout probability	`0.0-0.1` for regularization
`peft.target_modules`	Module patterns to apply LoRA	`null` = all linear layers (default)
`peft.merge`	Merge LoRA weights into base model	`false` (default). If `true`, produces full-weight checkpoint
`peft.use_dora`	Enable DoRA (Weight-Decomposed Low-Rank Adaptation)	`false` (default)

"training": {
    "type": "sft",
    "peft": {
        "type": "lora",
        "rank": 8,
        "alpha": 32,
        "dropout": 0.0,
        "target_modules": ["*.q_proj", "*.v_proj"]  # Optional: specific modules
    }
}

Distillation Configuration#

Note

Knowledge distillation is not yet supported.

Common Tuning Scenarios#

Loss Not Decreasing (Underfitting)#

"training": {
    "type": "sft",
    "peft": {"type": "lora"},
    "epochs": 5,  # Increase from 3
    "learning_rate": 0.0001,  # Increase from 5e-5
    "warmup_steps": 50  # Add warmup
}

Out of Memory (OOM) Errors#

"training": {
    "type": "sft",
    "peft": {"type": "lora"},
    "batch_size": 8,  # Reduce from 32
    "micro_batch_size": 1,  # Reduce from 2
    "max_seq_length": 1024  # Reduce from 2048
}

Overfitting (Validation Loss Increasing)#

"training": {
    "type": "sft",
    "peft": {
        "type": "lora",
        "rank": 8,
        "dropout": 0.1  # Add dropout
    },
    "epochs": 2,  # Reduce from 5
    "learning_rate": 0.00002,  # Lower to 2e-5
    "weight_decay": 0.01
}

GPU Memory Guidelines#

Estimated GPU requirements by model size:

Model Size	LoRA (1 GPU)	Full FT (min GPUs)
1B	16GB	1 × 24GB
3B	24GB	2 × 24GB
7-8B	40GB	2-4 × 80GB
13B	80GB	4 × 80GB
70B	2 × 80GB	8+ × 80GB

Tip

Use LoRA for most fine-tuning tasks. It’s significantly more memory-efficient and often achieves comparable results to full fine-tuning.