nemo_rl.environments.rewards#

Module Contents#

Functions#

math_expression_reward

Reward the agent when the answer within the <{tag}> tags is the same expression as the ground truth.

format_reward

Reward the agent when the response follows the format: (.) (.) (.*) .

exact_answer_alphanumeric_reward

Reward the agent when the answer within the <{answer_tag}> tags is the same as the ground truth (case-insensitive).

bbox_giou_reward

Given [x1, y1, x2, y2] normalized bounding box coordinates within the <{answer_tag}> tags, compute the GIoU between the ground truth and the response.

combine_reward_functions

Returns a callable function that takes (ground_truth, response) and collects multiple reward functions in sequence.

Data#

API#

nemo_rl.environments.rewards.math_verify_func#

β€˜math_metric(…)’

nemo_rl.environments.rewards.boxed#

None

nemo_rl.environments.rewards.math_expression_reward(
ground_truth: str,
response: str,
tag: str = 'answer',
) tuple[float, bool]#

Reward the agent when the answer within the <{tag}> tags is the same expression as the ground truth.

The tag is customizable and must be specified as part of the user COT prompt text file.

nemo_rl.environments.rewards.format_reward(
ground_truth: str,
response: str,
think_tag: str = 'think',
answer_tag: str = 'answer',
) tuple[float, Optional[bool]]#

Reward the agent when the response follows the format: (.) (.) (.*) .

The think_tag and answer_tag are customizable and must be specified as part of the user COT prompt text file.

nemo_rl.environments.rewards.exact_answer_alphanumeric_reward(
ground_truth: str,
response: str,
answer_tag: str = 'answer',
) tuple[float, bool]#

Reward the agent when the answer within the <{answer_tag}> tags is the same as the ground truth (case-insensitive).

The answer_tag is customizable and must be specified as part of the user COT prompt text file.

nemo_rl.environments.rewards.bbox_giou_reward(
ground_truth: str,
response: str,
giou_penalty_thres: float = 10.0,
answer_tag: str = 'answer',
) tuple[float, bool]#

Given [x1, y1, x2, y2] normalized bounding box coordinates within the <{answer_tag}> tags, compute the GIoU between the ground truth and the response.

The answer_tag is customizable and must be specified as part of the user COT prompt text file.

nemo_rl.environments.rewards.combine_reward_functions(
reward_functions: list[tuple[Callable[[str, str], tuple[float, bool]], float]],
) Callable[[str, str], tuple[float, bool]]#

Returns a callable function that takes (ground_truth, response) and collects multiple reward functions in sequence.

The reward functions are weighted by the second element of the tuple. This information can be provided in the YAML config file and resolved in the VLMEnvironment class.

Parameters:

reward_functions – list[tuple[Callable[[str, str], tuple[float, bool]], float]]. A list of reward functions and their weights.

Returns:

A callable function that takes (ground_truth, response) and collects multiple reward functions in sequence

Return type:

Callable[[str, str], tuple[float, bool]]