nemo_rl.algorithms.reward_functions#

Module Contents#

Classes#

RewardShapingConfig

Configuration for reward function processing.

Functions#

apply_reward_shaping

Process rewards by applying penalties for responses exceeding max_response_length. Currently, this function only supports DAPO reward shaping as illustrated in the DAPO paper : https://arxiv.org/pdf/2503.14476.

Data#

API#

nemo_rl.algorithms.reward_functions.Tensor#

β€˜TypeVar(…)’

class nemo_rl.algorithms.reward_functions.RewardShapingConfig#

Bases: typing.TypedDict

Configuration for reward function processing.

This configuration enables custom reward shaping, currently supporting DAPO-style penalties for responses that exceed the maximum response length threshold.

Initialization

Initialize self. See help(type(self)) for accurate signature.

enabled: bool#

None

overlong_buffer_length: NotRequired[int]#

None

overlong_buffer_penalty: NotRequired[float]#

None

max_response_length: NotRequired[int]#

None

nemo_rl.algorithms.reward_functions.apply_reward_shaping(
batch: nemo_rl.distributed.batched_data_dict.BatchedDataDict,
cfg: nemo_rl.algorithms.reward_functions.RewardShapingConfig,
) nemo_rl.distributed.batched_data_dict.BatchedDataDict#

Process rewards by applying penalties for responses exceeding max_response_length. Currently, this function only supports DAPO reward shaping as illustrated in the DAPO paper : https://arxiv.org/pdf/2503.14476.

Nonetheless, it can be potentially extended to support any custom reward logic.