Hyperparameter Options#
Tip
Want to learn about hyperparameters at a high level? Check out the Customization concepts page.
For a complete list of hyperparameters and their valid values with constraints and types:
Hyperparameters object
Properties
finetuning_type * string
The finetuning type for the customization job.
Allowed values:
loralora_mergedall_weightstraining_type string
The training type for the customization job.
Allowed values:
dposftdistillationDefault:
sftwarmup_steps integer
Learning rate schedulers gradually increase the learning rate from a small initial value to the target value
in `learning_rate` over this number of steps
Default:
200seed integer
This is the seed that will be used to initialize all underlying Pytorch and Triton Trainers.
By default this will be randomly initialized.
Caution: There are a number of processes that still introduce variance between training runs for models trained
from HF checkpoint.
Default:
42max_steps integer
If this parameter is provided and is greater than 0, we will stop execution after this number of steps.
This number can not be less than val_check_interval.
If less than val_check_interval it will set val_check_interval to be max_steps - 1
Default:
-1optimizer string
The supported optimizers that are configurable for customization.
Cosine Annealing LR scheduler will start at min_learning_rate and move towards learning_rate over warmup_steps.
Note: For models listed as NeMo checkpoint type, the only Adam implementation is Fused AdamW.
Allowed values:
adam_with_cosine_annealingadam_with_flat_lradamw_with_cosine_annealingadamw_with_flat_lrDefault:
adamw_with_cosine_annealingadam_beta1 number
Controls the exponential decay rate for the moving average of past gradients (momentum), only used with cosine_annealing learning rate schedulers
Default:
0.9adam_beta2 number
Controls the decay rate for the moving average of past squared gradients (adaptive learning rate scaling), only used with cosine_annealing learning rate schedulers
Default:
0.99min_learning_rate number
Starting point for learning_rate scheduling, only used with cosine_annealing learning rate schedulers. Must be lower than learning_rate if provided.
If not provided, or 0, this will default to 0.1 * learning_rate.
batch_size integer
Batch size is the number of training samples used to train a single forward and backward pass.
This is related to the gradient_accumulation_steps in HF documentation where gradient_accumulation_steps = batch_size // micro_batch_size
The default batch size for DPO when not provided is 16
Default:
8epochs integer
Epochs is the number of complete passes through the training dataset.
Default for DPO when not provided is 1
Default:
50learning_rate number
How much to adjust the model parameters in response to the loss gradient.
Default for DPO when not provided is 9e-06
Default:
0.0001log_every_n_steps integer
Control logging frequency for metrics tracking.
It may slow down training to log on every single batch.
By default, logs every 10 training steps. This parameter is log_frequency in HF
Default:
10val_check_interval number
Control how often to check the validation set, and how often to check for best checkpoint.
You can check after a fixed number of training batches by passing an integer value.
You can pass a float in the range [0.1, 1.0] to check after a fraction of the training epoch.
If the best checkpoint is found after validation, it will be saved at that time temporarily, it is currently
only uploaded at the end of the training run.
Note: Early Stopping monitors the validation loss and stops the training when no improvement is observed
after 10 epochs with a minimum delta of 0.001.
If val_check_interval is greater than the number of training batches, validation will run every epoch.
weight_decay number
An additional penalty term added to the gradient descent to keep weights low and mitigate overfitting.
sft object
SFT specific parameters
Properties
hidden_dropout number
Dropout probability for hidden state transformer.
attention_dropout number
Dropout probability for attention.
dpo object
DPO specific parameters
Properties
ref_policy_kl_penalty number
Controls how strongly the trained policy is penalized for deviating from the reference policy. Increasing this value encourages the policy to stay closer to the reference (more conservative learning), while decreasing it allows more freedom to explore user-preferred behavior. Parameter is called `beta` in the original paper
Default:
0.05preference_loss_weight number
Scales the contribution of the preference loss to the overall training objective. Increasing this value emphasizes learning from preference comparisons more strongly.
Default:
1preference_average_log_probs boolean
If set to true, the preference loss uses average log-probabilities, making the loss less sensitive to sequence length. Setting it to false (default) uses total log-probabilities, giving more influence to longer sequences.
Default:
Falsesft_loss_weight number
Scales the contribution of the supervised fine-tuning loss. Setting this to 0 disables SFT entirely, allowing training to focus exclusively on preference-based optimization.
Default:
0sft_average_log_probs boolean
If set to true, the supervised fine-tuning (SFT) loss normalizes by sequence length, treating all examples equally regardless of length. If false (default), longer examples contribute more to the loss.
Default:
Falselora object
LoRa specific parameters
Properties
adapter_dim integer
Size of adapter layers added throughout the model. This is the size of the tunable layers that LoRA adds to various transformer blocks in the base model. This parameter is a power of 2.
Default:
8alpha integer
Scaling factor for the LoRA update. Controls the magnitude of the low-rank approximation. A higher alpha value increases the impact of the LoRA weights, effectively amplifying the changes made to the original model. Proper tuning of alpha is essential, as it balances the adaptation's impact, ensuring neither underfitting nor overfitting. This is often a multiple of Adapter Dimension
Default:
16adapter_dropout number
Dropout probability in the adapter layer.
target_modules array
Target specific layers in the model architecture to apply LoRA. We select a subset of the layers by default.
However, specific layers can also be selected. For example:
- `linear_qkv`: Apply LoRA to the fused linear layer used for query, key, and value projections in self-attention.
- `linear_proj`: Apply LoRA to the linear layer used for projecting the output of self-attention.
- `linear_fc1`: Apply LoRA to the first fully-connected layer in MLP.
- `linear_fc2`: Apply LoRA to the second fully-connected layer in MLP.
- `*_proj`: Apply LoRA to all layers used for projecting the output of self-attention.
Target modules can also contain wildcards. For example, you can specify `target_modules=['*.layers.0.*.linear_qkv', '*.layers.1.*.linear_qkv']` to add LoRA to only linear_qkv on the first two layers.
Our framework only supports a Fused LoRA implementation, Cannonical LoRA is not supported.
Array items:
item string
distillation object
Knowledge Distillation specific parameters
Properties
teacher * string
Target to be used as teacher for distillation.
sequence_packing_enabled boolean
Sequence packing can improve speed of training by letting the training work on multiple rows at the same time. Experimental and not supported by all models. If a model is not supported, a warning will be returned in the response body and training will proceed with sequence packing disabled. Not recommended for produciton use. This flag may be removed in the future. See https://docs.nvidia.com/nemo-framework/user-guide/latest/sft_peft/packed_sequence.html for more details.
Default:
False