nemo_microservices.types.customization.dpo_parameters#
Module Contents#
Classes#
API#
- class nemo_microservices.types.customization.dpo_parameters.DpoParameters(/, **data: typing.Any)#
Bases:
nemo_microservices._models.BaseModel- max_grad_norm: Optional[float]#
None
Maximum gradient norm for gradient clipping during training.
Prevents exploding gradients by scaling down gradients that exceed this threshold. Lower this value (e.g., 0.5) if you observe training instability, NaN losses, or erratic loss spikes. Increase it (e.g., 5.0) if training seems overly conservative or progress is too slow. Typical values range from 0.5 to 5.0.
- preference_average_log_probs: Optional[bool]#
None
If set to true, the preference loss uses average log-probabilities, making the loss less sensitive to sequence length. Setting it to false (default) uses total log-probabilities, giving more influence to longer sequences.
- preference_loss_weight: Optional[float]#
None
Scales the contribution of the preference loss to the overall training objective.
Increasing this value emphasizes learning from preference comparisons more strongly.
- ref_policy_kl_penalty: Optional[float]#
None
Controls how strongly the trained policy is penalized for deviating from the reference policy. Increasing this value encourages the policy to stay closer to the reference (more conservative learning), while decreasing it allows more freedom to explore user-preferred behavior. Parameter is called
betain the original paper
- sft_average_log_probs: Optional[bool]#
None
If set to true, the supervised fine-tuning (SFT) loss normalizes by sequence length, treating all examples equally regardless of length. If false (default), longer examples contribute more to the loss.
- sft_loss_weight: Optional[float]#
None
Scales the contribution of the supervised fine-tuning loss.
Setting this to 0 disables SFT entirely, allowing training to focus exclusively on preference-based optimization.