Training Configuration

View as Markdown

Want to learn about training concepts at a high level? Check out the Customization concepts page.

NeMo Customizer ships two training backends, and each accepts its own job configuration. Choose the backend that matches your hardware and training goal, then configure the hyperparameters from that backend’s schema below.

BackendBest forTraining methodsHardware
Automodel (default)Production fine-tuning, larger models, multi-GPU scalingSFT, distillation; LoRA, merged-LoRA, or full-weightSingle- or multi-GPU (tensor / pipeline / context / expert parallel)
UnslothMemory-constrained single-GPU LoRASFT; LoRA or full-weightSingle GPU (4-bit / 8-bit quantization)

The two backends do not share field names. For example, Automodel uses batch.global_batch_size / batch.micro_batch_size and a parallelism block; Unsloth uses batch.per_device_train_batch_size / batch.gradient_accumulation_steps and a hardware block. Both schemas reject unknown keys, so a field from one backend will not validate against the other.

Each backend can also print its live schema, and the generated REST shapes are in the Customizer API Reference (search for AutomodelJobInput and UnslothJobInput).


Automodel Configuration

An Automodel job is configured with the following top-level sections: model, dataset, training, schedule, batch, optimizer, parallelism, output, and (optionally) integrations.

Model and Dataset

FieldDescriptionDefault
modelBase Model Entity reference (name or workspace/name) to fine-tune(required)
dataset.trainingTraining fileset reference (name or workspace/name)(required)
dataset.validationOptional validation fileset referencenull
dataset.prompt_templateOptional prompt template for custom dataset schemasnull

Training Method

ParameterValuesDescriptionDefault
training.training_typesft, distillationTraining methodsft
training.finetuning_typelora, lora_merged, all_weightsAdapter regime. lora trains an adapter; lora_merged merges it into the base weights; all_weights performs full-weight traininglora
training.lora{ rank, alpha, merge, target_modules }LoRA configuration (auto-filled with defaults when finetuning_type is a LoRA variant)(see below)
training.max_seq_lengthintegerMaximum sequence length2048

LoRA parameters (training.lora):

ParameterDescriptionDefault
rankLoRA rank (low-rank dimension). Higher = more capacity and memory16
alphaLoRA scaling factor32
mergeMerge the adapter into the base model at the end of trainingfalse
target_modulesList of module patterns to adapt; null applies LoRA to all linear layersnull

Schedule

ParameterDescriptionDefault
schedule.epochsNumber of passes over the training data1
schedule.max_stepsOptional cap on training steps. When set, training stops at this many steps even if epochs is not reachednull
schedule.val_check_intervalValidation cadence. Use a fractional value (e.g. 0.5 for twice per epoch); avoid integer step counts that may not divide evenlynull
schedule.seedRandom seednull

Batch

ParameterDescriptionDefault
batch.global_batch_sizeEffective batch size across all data-parallel ranks8
batch.micro_batch_sizePer-step batch size on each device1
batch.sequence_packingPack multiple samples into one sequence to reduce paddingfalse

Optimizer

ParameterDescriptionDefault
optimizer.learning_rateStep size for weight updates5e-6
optimizer.weight_decayL2 regularization strength0.01
optimizer.warmup_stepsLinear warmup steps before the main schedule0

Parallelism

The parallelism block scales Automodel training across GPUs and nodes.

ParameterDescriptionDefault
parallelism.num_nodesNumber of training nodes1
parallelism.num_gpus_per_nodeGPUs per node1
parallelism.tensor_parallel_sizeGPUs for tensor parallelism (splits layers across GPUs for large models)1
parallelism.pipeline_parallel_sizeGPUs for pipeline parallelism (splits model stages across GPUs)1
parallelism.context_parallel_sizeGPUs for context parallelism (for very long sequences)1
parallelism.expert_parallel_sizeExpert parallelism for MoE models; must divide the number of expertsnull

GPU relationships and constraints:

  • total_gpus = num_gpus_per_node × num_nodes.
  • total_gpus must be divisible by tensor_parallel_size × pipeline_parallel_size × context_parallel_size.
  • data_parallel_size is derived as total_gpus / (TP × PP × CP), and global_batch_size must be divisible by micro_batch_size × data_parallel_size.
  • For MoE models, tensor parallelism must be 1 when expert_parallel_size > 1.

Distillation

When training.training_type is "distillation", the following additional fields configure knowledge distillation from a teacher model:

ParameterDescriptionDefault
training.teacher_modelTeacher Model Entity reference. Required for distillation. Must share the student’s tokenizer and vocabulary(required)
training.teacher_precisionPrecision for loading the frozen teacher (bf16, fp16, fp32)bf16
training.distillation_ratioBalance between cross-entropy loss and KD loss. 0.0 = CE only, 1.0 = KD only0.5
training.distillation_temperatureSoftmax temperature for KD. Higher = softer distributions1.0
training.offload_teacherOffload the teacher model to save GPU memoryfalse
  • Knowledge distillation uses logit-pair distillation — the student learns to match the teacher’s output probability distribution.
  • Both student and teacher must be full-weight Model Entities and share the same tokenizer and vocabulary. Use models from the same family (e.g. Qwen3 1.7B + Qwen3 4B).
  • Both models are loaded during training; plan GPU memory accordingly (or set offload_teacher).

Unsloth Configuration

An Unsloth job is configured with the following top-level sections: model, dataset, training, schedule, batch, optimizer, hardware, output, and (optionally) integrations. Unsloth runs on a single GPU and supports 4-bit / 8-bit quantized loading.

Model

ParameterDescriptionDefault
model.nameBase Model Entity reference (name or workspace/name)(required)
model.max_seq_lengthMaximum sequence length2048
model.load_in_4bitLoad the base model in 4-bit (bitsandbytes). Mutually exclusive with load_in_8bittrue
model.load_in_8bitLoad the base model in 8-bitfalse
model.dtypeCompute dtype (auto, bfloat16, float16, float32)auto
model.trust_remote_codeAllow custom model code from the checkpointfalse

Full-weight training (training.finetuning_type: "all_weights") cannot be combined with quantized loading. Set load_in_4bit and load_in_8bit to false for full-weight runs.

Dataset

ParameterDescriptionDefault
dataset.pathTraining fileset reference (name or workspace/name)(required)
dataset.text_fieldRow field consumed by the trainertext
dataset.apply_chat_templateApply the tokenizer’s chat template to rows containing a messages fieldfalse
dataset.validation_pathOptional validation fileset referencenull
dataset.packingPack multiple samples into one sequence to reduce paddingfalse

Training Method

ParameterValuesDescriptionDefault
training.training_typesftTraining methodsft
training.finetuning_typelora, all_weightsAdapter regime. lora trains an adapter; all_weights performs full-weight traininglora
training.lora{ rank, alpha, dropout, target_modules, bias, use_rslora, random_state }LoRA configuration (auto-filled with defaults when finetuning_type is lora)(see below)
training.use_gradient_checkpointingunsloth, true, falseGradient checkpointing mode. unsloth uses Unsloth’s optimized implementationunsloth

LoRA parameters (training.lora):

ParameterDescriptionDefault
rankLoRA rank16
alphaLoRA scaling factor16
dropoutLoRA dropout probability0.0
target_modulesModules to adaptUnsloth’s 7-module set: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
biasBias training mode (none, all, lora_only)none
use_rsloraUse rank-stabilized LoRAfalse
random_stateLoRA initialization seed3407

Schedule

ParameterDescriptionDefault
schedule.epochsNumber of passes over the training data1
schedule.max_stepsOptional cap on training steps (overrides epochs when set)null
schedule.warmup_stepsLinear warmup steps. Mutually exclusive with warmup_ratio0
schedule.warmup_ratioWarmup as a fraction of total stepsnull
schedule.lr_scheduler_typelinear, cosine, constant, constant_with_warmup, cosine_with_restartslinear
schedule.logging_stepsLogging cadence (steps)1
schedule.save_stepsCheckpoint cadence (steps)null
schedule.eval_stepsEvaluation cadence (steps)null
schedule.seedRandom seed3407

Batch

ParameterDescriptionDefault
batch.per_device_train_batch_sizePer-step batch size on the GPU1
batch.gradient_accumulation_stepsSteps to accumulate before a weight update1

Optimizer

ParameterDescriptionDefault
optimizer.learning_rateStep size for weight updates2e-4
optimizer.weight_decayL2 regularization strength0.0
optimizer.optimOptimizer (adamw_torch, adamw_torch_fused, adamw_8bit, paged_adamw_8bit, sgd). 8-bit optimizers reduce optimizer-state memoryadamw_8bit

Hardware

ParameterDescriptionDefault
hardware.gpusComma-separated GPU indices (0 or 0,1) for CUDA_VISIBLE_DEVICES (selection, not reservation)null
hardware.precisionMixed-precision dtype (bf16, fp16). bf16 recommended for Ampere+bf16

Output (save method)

Unsloth’s output save_method controls the saved checkpoint shape:

save_methodResult
loraSaves the LoRA adapter (default)
merged_16bitMerges the adapter into the base and saves a 16-bit checkpoint
merged_4bitMerges the adapter into the base and saves a 4-bit checkpoint

The merged_* methods are only valid when training.finetuning_type is lora.


GPU Memory Guidelines

Estimated GPU requirements by model size:

Model SizeLoRA (1 GPU)Full FT (min GPUs)
1B16 GB1 × 24 GB
3B24 GB2 × 24 GB
7-8B40 GB2-4 × 80 GB
13B80 GB4 × 80 GB
70B2 × 80 GB8+ × 80 GB

Use LoRA for most fine-tuning tasks — it is significantly more memory-efficient and often achieves results comparable to full fine-tuning. On a single memory-constrained GPU, the Unsloth backend with 4-bit loading fits the largest adapters.