Training Configuration
Want to learn about training concepts at a high level? Check out the Customization concepts page.
NeMo Customizer ships two training backends, and each accepts its own job configuration. Choose the backend that matches your hardware and training goal, then configure the hyperparameters from that backend’s schema below.
The two backends do not share field names. For example, Automodel uses batch.global_batch_size / batch.micro_batch_size and a parallelism block; Unsloth uses batch.per_device_train_batch_size / batch.gradient_accumulation_steps and a hardware block. Both schemas reject unknown keys, so a field from one backend will not validate against the other.
Each backend can also print its live schema, and the generated REST shapes are in the Customizer API Reference (search for AutomodelJobInput and UnslothJobInput).
Automodel Configuration
An Automodel job is configured with the following top-level sections: model, dataset, training, schedule, batch, optimizer, parallelism, output, and (optionally) integrations.
Model and Dataset
Training Method
LoRA parameters (training.lora):
Schedule
Batch
Optimizer
Parallelism
The parallelism block scales Automodel training across GPUs and nodes.
GPU relationships and constraints:
total_gpus = num_gpus_per_node × num_nodes.total_gpusmust be divisible bytensor_parallel_size × pipeline_parallel_size × context_parallel_size.data_parallel_sizeis derived astotal_gpus / (TP × PP × CP), andglobal_batch_sizemust be divisible bymicro_batch_size × data_parallel_size.- For MoE models, tensor parallelism must be
1whenexpert_parallel_size > 1.
Distillation
When training.training_type is "distillation", the following additional fields configure knowledge distillation from a teacher model:
- Knowledge distillation uses logit-pair distillation — the student learns to match the teacher’s output probability distribution.
- Both student and teacher must be full-weight Model Entities and share the same tokenizer and vocabulary. Use models from the same family (e.g. Qwen3 1.7B + Qwen3 4B).
- Both models are loaded during training; plan GPU memory accordingly (or set
offload_teacher).
Unsloth Configuration
An Unsloth job is configured with the following top-level sections: model, dataset, training, schedule, batch, optimizer, hardware, output, and (optionally) integrations. Unsloth runs on a single GPU and supports 4-bit / 8-bit quantized loading.
Model
Full-weight training (training.finetuning_type: "all_weights") cannot be combined with quantized loading. Set load_in_4bit and load_in_8bit to false for full-weight runs.
Dataset
Training Method
LoRA parameters (training.lora):
Schedule
Batch
Optimizer
Hardware
Output (save method)
Unsloth’s output save_method controls the saved checkpoint shape:
The merged_* methods are only valid when training.finetuning_type is lora.
GPU Memory Guidelines
Estimated GPU requirements by model size:
Use LoRA for most fine-tuning tasks — it is significantly more memory-efficient and often achieves results comparable to full fine-tuning. On a single memory-constrained GPU, the Unsloth backend with 4-bit loading fits the largest adapters.