`nemo_rl.algorithms.reward_functions`#

Module Contents#

Classes#

RewardShapingConfig

Configuration for reward function processing.

Functions#

apply_reward_shaping

Process rewards by applying penalties for responses exceeding max_response_length. Currently, this function only supports DAPO reward shaping as illustrated in the DAPO paper : https://arxiv.org/pdf/2503.14476.

Data#

Tensor

API#

nemo_rl.algorithms.reward_functions.Tensor#: ‘TypeVar(…)’

class nemo_rl.algorithms.reward_functions.RewardShapingConfig#

Bases: typing.TypedDict

Configuration for reward function processing.

This configuration enables custom reward shaping, currently supporting DAPO-style penalties for responses that exceed the maximum response length threshold.

Initialization

Initialize self. See help(type(self)) for accurate signature.

enabled: bool#: None

overlong_buffer_length: NotRequired[int]#: None

overlong_buffer_penalty: NotRequired[float]#: None

max_response_length: NotRequired[int]#: None

nemo_rl.algorithms.reward_functions.apply_reward_shaping( batch: nemo_rl.distributed.batched_data_dict.BatchedDataDict, cfg: nemo_rl.algorithms.reward_functions.RewardShapingConfig, ) → nemo_rl.distributed.batched_data_dict.BatchedDataDict#