`nemo_rl.environments.rewards`#

Module Contents#

Functions#

`math_expression_reward`	Reward the agent when the answer within the <{tag}> tags is the same expression as the ground truth.
`format_reward`	Reward the agent when the response follows the format: (.) (.) (.*) .
`exact_answer_alphanumeric_reward`	Reward the agent when the answer within the <{answer_tag}> tags is the same as the ground truth (case-insensitive).
`bbox_giou_reward`	Given [x1, y1, x2, y2] normalized bounding box coordinates within the <{answer_tag}> tags, compute the GIoU between the ground truth and the response.
`combine_reward_functions`	Returns a callable function that takes (ground_truth, response) and collects multiple reward functions in sequence.

Data#

`math_verify_func`
`boxed`

API#

nemo_rl.environments.rewards.math_verify_func#: ‘math_metric(…)’

nemo_rl.environments.rewards.boxed#: None

nemo_rl.environments.rewards.math_expression_reward( ground_truth: str, response: str, tag: str = 'answer', ) → tuple[float, bool]#

Reward the agent when the answer within the <{tag}> tags is the same expression as the ground truth.

The tag is customizable and must be specified as part of the user COT prompt text file.

nemo_rl.environments.rewards.format_reward( ground_truth: str, response: str, think_tag: str = 'think', answer_tag: str = 'answer', ) → tuple[float, Optional[bool]]#

Reward the agent when the response follows the format: (.) (.) (.*) .

The think_tag and answer_tag are customizable and must be specified as part of the user COT prompt text file.

nemo_rl.environments.rewards.exact_answer_alphanumeric_reward( ground_truth: str, response: str, answer_tag: str = 'answer', ) → tuple[float, bool]#

Reward the agent when the answer within the <{answer_tag}> tags is the same as the ground truth (case-insensitive).

The answer_tag is customizable and must be specified as part of the user COT prompt text file.

nemo_rl.environments.rewards.bbox_giou_reward( ground_truth: str, response: str, giou_penalty_thres: float = 10.0, answer_tag: str = 'answer', ) → tuple[float, bool]#

Given [x1, y1, x2, y2] normalized bounding box coordinates within the <{answer_tag}> tags, compute the GIoU between the ground truth and the response.

The answer_tag is customizable and must be specified as part of the user COT prompt text file.

nemo_rl.environments.rewards.combine_reward_functions( reward_functions: list[tuple[Callable[[str, str], tuple[float, bool]], float]], ) → Callable[[str, str], tuple[float, bool]]#

Returns a callable function that takes (ground_truth, response) and collects multiple reward functions in sequence.

The reward functions are weighted by the second element of the tuple. This information can be provided in the YAML config file and resolved in the VLMEnvironment class.

Parameters:: reward_functions – list[tuple[Callable[[str, str], tuple[float, bool]], float]]. A list of reward functions and their weights.
Returns:: A callable function that takes (ground_truth, response) and collects multiple reward functions in sequence
Return type:: Callable[[str, str], tuple[float, bool]]

nemo_rl.environments.rewards#

Module Contents#

Functions#

Data#

API#

`nemo_rl.environments.rewards`#