> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo-platform/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo-platform/_mcp/server.

# Training Configuration

<a id="ft-hyperparameters" />

Want to learn about training concepts at a high level? Check out the [Customization concepts](/documentation/customizer-reference/customization-concepts) page.

NeMo Customizer ships **two training backends**, and each accepts its own job configuration. Choose the backend that matches your hardware and training goal, then configure the hyperparameters from that backend's schema below.

| Backend                 | Best for                                                 | Training methods                                     | Hardware                                                             |
| ----------------------- | -------------------------------------------------------- | ---------------------------------------------------- | -------------------------------------------------------------------- |
| **Automodel** (default) | Production fine-tuning, larger models, multi-GPU scaling | SFT, distillation; LoRA, merged-LoRA, or full-weight | Single- or multi-GPU (tensor / pipeline / context / expert parallel) |
| **Unsloth**             | Memory-constrained single-GPU LoRA                       | SFT; LoRA or full-weight                             | Single GPU (4-bit / 8-bit quantization)                              |

The two backends do **not** share field names. For example, Automodel uses `batch.global_batch_size` / `batch.micro_batch_size` and a `parallelism` block; Unsloth uses `batch.per_device_train_batch_size` / `batch.gradient_accumulation_steps` and a `hardware` block. Both schemas reject unknown keys, so a field from one backend will not validate against the other.

Each backend can also print its live schema, and the generated REST shapes are in the [Customizer API Reference](/documentation/reference/api-reference) (search for `AutomodelJobInput` and `UnslothJobInput`).

***

## Automodel Configuration

An Automodel job is configured with the following top-level sections: `model`, `dataset`, `training`, `schedule`, `batch`, `optimizer`, `parallelism`, `output`, and (optionally) `integrations`.

### Model and Dataset

| Field                     | Description                                                               | Default      |
| ------------------------- | ------------------------------------------------------------------------- | ------------ |
| `model`                   | Base **Model Entity** reference (`name` or `workspace/name`) to fine-tune | *(required)* |
| `dataset.training`        | Training fileset reference (`name` or `workspace/name`)                   | *(required)* |
| `dataset.validation`      | Optional validation fileset reference                                     | `null`       |
| `dataset.prompt_template` | Optional prompt template for custom dataset schemas                       | `null`       |

### Training Method

| Parameter                  | Values                                   | Description                                                                                                                          | Default       |
| -------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ------------- |
| `training.training_type`   | `sft`, `distillation`                    | Training method                                                                                                                      | `sft`         |
| `training.finetuning_type` | `lora`, `lora_merged`, `all_weights`     | Adapter regime. `lora` trains an adapter; `lora_merged` merges it into the base weights; `all_weights` performs full-weight training | `lora`        |
| `training.lora`            | `{ rank, alpha, merge, target_modules }` | LoRA configuration (auto-filled with defaults when `finetuning_type` is a LoRA variant)                                              | *(see below)* |
| `training.max_seq_length`  | integer                                  | Maximum sequence length                                                                                                              | `2048`        |

LoRA parameters (`training.lora`):

| Parameter        | Description                                                                | Default |
| ---------------- | -------------------------------------------------------------------------- | ------- |
| `rank`           | LoRA rank (low-rank dimension). Higher = more capacity and memory          | `16`    |
| `alpha`          | LoRA scaling factor                                                        | `32`    |
| `merge`          | Merge the adapter into the base model at the end of training               | `false` |
| `target_modules` | List of module patterns to adapt; `null` applies LoRA to all linear layers | `null`  |

### Schedule

| Parameter                     | Description                                                                                                                       | Default |
| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | ------- |
| `schedule.epochs`             | Number of passes over the training data                                                                                           | `1`     |
| `schedule.max_steps`          | Optional cap on training steps. When set, training stops at this many steps even if `epochs` is not reached                       | `null`  |
| `schedule.val_check_interval` | Validation cadence. Use a fractional value (e.g. `0.5` for twice per epoch); avoid integer step counts that may not divide evenly | `null`  |
| `schedule.seed`               | Random seed                                                                                                                       | `null`  |

### Batch

| Parameter                 | Description                                               | Default |
| ------------------------- | --------------------------------------------------------- | ------- |
| `batch.global_batch_size` | Effective batch size across all data-parallel ranks       | `8`     |
| `batch.micro_batch_size`  | Per-step batch size on each device                        | `1`     |
| `batch.sequence_packing`  | Pack multiple samples into one sequence to reduce padding | `false` |

### Optimizer

| Parameter                 | Description                                  | Default |
| ------------------------- | -------------------------------------------- | ------- |
| `optimizer.learning_rate` | Step size for weight updates                 | `5e-6`  |
| `optimizer.weight_decay`  | L2 regularization strength                   | `0.01`  |
| `optimizer.warmup_steps`  | Linear warmup steps before the main schedule | `0`     |

### Parallelism

The `parallelism` block scales Automodel training across GPUs and nodes.

| Parameter                            | Description                                                              | Default |
| ------------------------------------ | ------------------------------------------------------------------------ | ------- |
| `parallelism.num_nodes`              | Number of training nodes                                                 | `1`     |
| `parallelism.num_gpus_per_node`      | GPUs per node                                                            | `1`     |
| `parallelism.tensor_parallel_size`   | GPUs for tensor parallelism (splits layers across GPUs for large models) | `1`     |
| `parallelism.pipeline_parallel_size` | GPUs for pipeline parallelism (splits model stages across GPUs)          | `1`     |
| `parallelism.context_parallel_size`  | GPUs for context parallelism (for very long sequences)                   | `1`     |
| `parallelism.expert_parallel_size`   | Expert parallelism for MoE models; must divide the number of experts     | `null`  |

**GPU relationships and constraints:**

* `total_gpus = num_gpus_per_node × num_nodes`.
* `total_gpus` must be divisible by `tensor_parallel_size × pipeline_parallel_size × context_parallel_size`.
* `data_parallel_size` is derived as `total_gpus / (TP × PP × CP)`, and `global_batch_size` must be divisible by `micro_batch_size × data_parallel_size`.
* For MoE models, tensor parallelism must be `1` when `expert_parallel_size > 1`.

### Distillation

When `training.training_type` is `"distillation"`, the following additional fields configure knowledge distillation from a teacher model:

| Parameter                           | Description                                                                                                  | Default      |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------ | ------------ |
| `training.teacher_model`            | Teacher Model Entity reference. Required for distillation. Must share the student's tokenizer and vocabulary | *(required)* |
| `training.teacher_precision`        | Precision for loading the frozen teacher (`bf16`, `fp16`, `fp32`)                                            | `bf16`       |
| `training.distillation_ratio`       | Balance between cross-entropy loss and KD loss. `0.0` = CE only, `1.0` = KD only                             | `0.5`        |
| `training.distillation_temperature` | Softmax temperature for KD. Higher = softer distributions                                                    | `1.0`        |
| `training.offload_teacher`          | Offload the teacher model to save GPU memory                                                                 | `false`      |

<a id="kd-constraints" />

* Knowledge distillation uses **logit-pair distillation** — the student learns to match the teacher's output probability distribution.
* Both student and teacher must be **full-weight Model Entities** and **share the same tokenizer and vocabulary**. Use models from the same family (e.g. Qwen3 1.7B + Qwen3 4B).
* Both models are loaded during training; plan GPU memory accordingly (or set `offload_teacher`).

***

## Unsloth Configuration

An Unsloth job is configured with the following top-level sections: `model`, `dataset`, `training`, `schedule`, `batch`, `optimizer`, `hardware`, `output`, and (optionally) `integrations`. Unsloth runs on a **single GPU** and supports 4-bit / 8-bit quantized loading.

### Model

| Parameter                 | Description                                                                         | Default      |
| ------------------------- | ----------------------------------------------------------------------------------- | ------------ |
| `model.name`              | Base Model Entity reference (`name` or `workspace/name`)                            | *(required)* |
| `model.max_seq_length`    | Maximum sequence length                                                             | `2048`       |
| `model.load_in_4bit`      | Load the base model in 4-bit (bitsandbytes). Mutually exclusive with `load_in_8bit` | `true`       |
| `model.load_in_8bit`      | Load the base model in 8-bit                                                        | `false`      |
| `model.dtype`             | Compute dtype (`auto`, `bfloat16`, `float16`, `float32`)                            | `auto`       |
| `model.trust_remote_code` | Allow custom model code from the checkpoint                                         | `false`      |

Full-weight training (`training.finetuning_type: "all_weights"`) cannot be combined with quantized loading. Set `load_in_4bit` and `load_in_8bit` to `false` for full-weight runs.

### Dataset

| Parameter                     | Description                                                               | Default      |
| ----------------------------- | ------------------------------------------------------------------------- | ------------ |
| `dataset.path`                | Training fileset reference (`name` or `workspace/name`)                   | *(required)* |
| `dataset.text_field`          | Row field consumed by the trainer                                         | `text`       |
| `dataset.apply_chat_template` | Apply the tokenizer's chat template to rows containing a `messages` field | `false`      |
| `dataset.validation_path`     | Optional validation fileset reference                                     | `null`       |
| `dataset.packing`             | Pack multiple samples into one sequence to reduce padding                 | `false`      |

### Training Method

| Parameter                             | Values                                                                     | Description                                                                           | Default       |
| ------------------------------------- | -------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | ------------- |
| `training.training_type`              | `sft`                                                                      | Training method                                                                       | `sft`         |
| `training.finetuning_type`            | `lora`, `all_weights`                                                      | Adapter regime. `lora` trains an adapter; `all_weights` performs full-weight training | `lora`        |
| `training.lora`                       | `{ rank, alpha, dropout, target_modules, bias, use_rslora, random_state }` | LoRA configuration (auto-filled with defaults when `finetuning_type` is `lora`)       | *(see below)* |
| `training.use_gradient_checkpointing` | `unsloth`, `true`, `false`                                                 | Gradient checkpointing mode. `unsloth` uses Unsloth's optimized implementation        | `unsloth`     |

LoRA parameters (`training.lora`):

| Parameter        | Description                                     | Default                                                                                             |
| ---------------- | ----------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| `rank`           | LoRA rank                                       | `16`                                                                                                |
| `alpha`          | LoRA scaling factor                             | `16`                                                                                                |
| `dropout`        | LoRA dropout probability                        | `0.0`                                                                                               |
| `target_modules` | Modules to adapt                                | Unsloth's 7-module set: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
| `bias`           | Bias training mode (`none`, `all`, `lora_only`) | `none`                                                                                              |
| `use_rslora`     | Use rank-stabilized LoRA                        | `false`                                                                                             |
| `random_state`   | LoRA initialization seed                        | `3407`                                                                                              |

### Schedule

| Parameter                    | Description                                                                    | Default  |
| ---------------------------- | ------------------------------------------------------------------------------ | -------- |
| `schedule.epochs`            | Number of passes over the training data                                        | `1`      |
| `schedule.max_steps`         | Optional cap on training steps (overrides `epochs` when set)                   | `null`   |
| `schedule.warmup_steps`      | Linear warmup steps. Mutually exclusive with `warmup_ratio`                    | `0`      |
| `schedule.warmup_ratio`      | Warmup as a fraction of total steps                                            | `null`   |
| `schedule.lr_scheduler_type` | `linear`, `cosine`, `constant`, `constant_with_warmup`, `cosine_with_restarts` | `linear` |
| `schedule.logging_steps`     | Logging cadence (steps)                                                        | `1`      |
| `schedule.save_steps`        | Checkpoint cadence (steps)                                                     | `null`   |
| `schedule.eval_steps`        | Evaluation cadence (steps)                                                     | `null`   |
| `schedule.seed`              | Random seed                                                                    | `3407`   |

### Batch

| Parameter                           | Description                                | Default |
| ----------------------------------- | ------------------------------------------ | ------- |
| `batch.per_device_train_batch_size` | Per-step batch size on the GPU             | `1`     |
| `batch.gradient_accumulation_steps` | Steps to accumulate before a weight update | `1`     |

### Optimizer

| Parameter                 | Description                                                                                                                             | Default      |
| ------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | ------------ |
| `optimizer.learning_rate` | Step size for weight updates                                                                                                            | `2e-4`       |
| `optimizer.weight_decay`  | L2 regularization strength                                                                                                              | `0.0`        |
| `optimizer.optim`         | Optimizer (`adamw_torch`, `adamw_torch_fused`, `adamw_8bit`, `paged_adamw_8bit`, `sgd`). 8-bit optimizers reduce optimizer-state memory | `adamw_8bit` |

### Hardware

| Parameter            | Description                                                                                        | Default |
| -------------------- | -------------------------------------------------------------------------------------------------- | ------- |
| `hardware.gpus`      | Comma-separated GPU indices (`0` or `0,1`) for `CUDA_VISIBLE_DEVICES` (selection, not reservation) | `null`  |
| `hardware.precision` | Mixed-precision dtype (`bf16`, `fp16`). `bf16` recommended for Ampere+                             | `bf16`  |

### Output (save method)

Unsloth's output `save_method` controls the saved checkpoint shape:

| `save_method`  | Result                                                         |
| -------------- | -------------------------------------------------------------- |
| `lora`         | Saves the LoRA adapter (default)                               |
| `merged_16bit` | Merges the adapter into the base and saves a 16-bit checkpoint |
| `merged_4bit`  | Merges the adapter into the base and saves a 4-bit checkpoint  |

The `merged_*` methods are only valid when `training.finetuning_type` is `lora`.

***

## GPU Memory Guidelines

Estimated GPU requirements by model size:

| Model Size | LoRA (1 GPU) | Full FT (min GPUs) |
| ---------- | ------------ | ------------------ |
| 1B         | 16 GB        | 1 × 24 GB          |
| 3B         | 24 GB        | 2 × 24 GB          |
| 7-8B       | 40 GB        | 2-4 × 80 GB        |
| 13B        | 80 GB        | 4 × 80 GB          |
| 70B        | 2 × 80 GB    | 8+ × 80 GB         |

Use LoRA for most fine-tuning tasks — it is significantly more memory-efficient and often achieves results comparable to full fine-tuning. On a single memory-constrained GPU, the Unsloth backend with 4-bit loading fits the largest adapters.