Hyperparameters#
- class nemo_microservices.types.customization.Hyperparameters(*args: Any, **kwargs: Any)
Bases:
BaseModel
- finetuning_type: Literal['lora', 'all_weights']
The finetuning type for the customization job.
- adam_beta1: float | None = None
Controls the exponential decay rate for the moving average of past gradients (momentum), only used with cosine_annealing learning rate schedulers
- adam_beta2: float | None = None
Controls the decay rate for the moving average of past squared gradients (adaptive learning rate scaling), only used with cosine_annealing learning rate schedulers
- batch_size: int | None = None
Batch size is the number of training samples used to train a single forward and backward pass. This is related to the gradient_accumulation_steps in HF documentation where gradient_accumulation_steps = batch_size // micro_batch_size The default batch size for DPO when not provided is 16
- distillation: DistillationParameters | None = None
Specific parameters for knowledge distillation
- dpo: Dpo | None = None
Specific parameters for DPO.
- epochs: int | None = None
Epochs is the number of complete passes through the training dataset. Default for DPO when not provided is 1
- learning_rate: float | None = None
How much to adjust the model parameters in response to the loss gradient. Default for DPO when not provided is 9e-06
- log_every_n_steps: int | None = None
Control logging frequency for metrics tracking. It may slow down training to log on every single batch.
By default, logs every 10 training steps. This parameter is log_frequency in HF
- lora: LoraParameters | None = None
Specific parameters for LoRA.
- max_steps: int | None = None
If this parameter is provided and is greater than 0, we will stop execution after this number of steps.
This number can not be less than val_check_interval.
If less than val_check_interval it will set val_check_interval to be max_steps - 1
- min_learning_rate: float | None = None
Starting point for learning_rate scheduling, only used with cosine_annealing learning rate schedulers. Must be lower than learning_rate if provided. If not provided, this will default to 0.1 ** learning_rate.
- optimizer: Literal['adam_with_cosine_annealing', 'adam_with_flat_lr', 'adamw_with_cosine_annealing', 'adamw_with_flat_lr'] | None = None
The supported optimizers that are configurable for customization.
Cosine Annealing LR scheduler will start at min_learning_rate and move towards learning_rate over warmup_steps.
Note: For models listed as NeMo checkpoint type, the only Adam implementation is Fused AdamW.
- seed: int | None = None
This is the seed that will be used to initialize all underlying Pytorch and Triton Trainers. By default this will be randomly initialized.
Caution: There are a number of processes that still introduce variance between training runs for models trained from HF checkpoint.
- sequence_packing_enabled: bool | None = None
Sequence packing can improve speed of training by letting the training work on multiple rows at the same time. Experimental and not supported by all models. If a model is not supported, a warning will be returned in the response body and training will proceed with sequence packing disabled. Not recommended for produciton use. This flag may be removed in the future. See https://docs.nvidia.com/nemo-framework/user-guide/latest/sft_peft/packed_sequence.html for more details.
- sft: SftParameters | None = None
Specific parameters for SFT.
- training_type: Literal['dpo', 'sft', 'distillation'] | None = None
The training type for the customization job.
- val_check_interval: float | None = None
Control how often to check the validation set, and how often to check for best checkpoint.
You can check after a fixed number of training batches by passing an integer value. You can pass a float in the range [0.1, 1.0] to check after a fraction of the training epoch.
If the best checkpoint is found after validation, it will be saved at that time temporarily, it is currently only uploaded at the end of the training run.
- Note: Early Stopping monitors the validation loss and stops the training when no improvement is observed
after 10 epochs with a minimum delta of 0.001.
If val_check_interval is greater than the number of training batches, validation will run every epoch.
- warmup_steps: int | None = None
Learning rate schedulers gradually increase the learning rate from a small initial value to the target value in learning_rate over this number of steps
- weight_decay: float | None = None
An additional penalty term added to the gradient descent to keep weights low and mitigate overfitting.