nemo_rl.algorithms.reward_functions
#
Module Contents#
Classes#
Configuration for reward function processing. |
Functions#
Process rewards by applying penalties for responses exceeding max_response_length. Currently, this function only supports DAPO reward shaping as illustrated in the DAPO paper : https://arxiv.org/pdf/2503.14476. |
Data#
API#
- nemo_rl.algorithms.reward_functions.Tensor#
βTypeVar(β¦)β
- class nemo_rl.algorithms.reward_functions.RewardShapingConfig#
Bases:
typing.TypedDict
Configuration for reward function processing.
This configuration enables custom reward shaping, currently supporting DAPO-style penalties for responses that exceed the maximum response length threshold.
Initialization
Initialize self. See help(type(self)) for accurate signature.
- enabled: bool#
None
- overlong_buffer_length: NotRequired[int]#
None
- overlong_buffer_penalty: NotRequired[float]#
None
- max_response_length: NotRequired[int]#
None
- nemo_rl.algorithms.reward_functions.apply_reward_shaping(
- batch: nemo_rl.distributed.batched_data_dict.BatchedDataDict,
- cfg: nemo_rl.algorithms.reward_functions.RewardShapingConfig,
Process rewards by applying penalties for responses exceeding max_response_length. Currently, this function only supports DAPO reward shaping as illustrated in the DAPO paper : https://arxiv.org/pdf/2503.14476.
Nonetheless, it can be potentially extended to support any custom reward logic.