nemo_rl.environments.math_environment#

Module Contents#

Classes#

Functions#

API#

class nemo_rl.environments.math_environment.MathEnvConfig#

Bases: typing.TypedDict

num_workers: int#

None

stop_strings: Optional[list[str]]#

None

verifier_type: Optional[str]#

None

nemo_rl.environments.math_environment._mute_output()#
class nemo_rl.environments.math_environment.HFVerifyWorker#

Initialization

verify(
pred_responses: list[str],
ground_truths: list[str],
return_extracted_answer: bool = False,
) Union[list[float], tuple[list[float], list[str | None]]]#

Verify the correctness of the predicted responses against the ground truth.

Parameters:
  • pred_responses โ€“ list[str]. The predicted responses from the LLM.

  • ground_truths โ€“ list[str]. The ground truth responses.

Returns:

Union[list[float], tuple[list[float], list[str | None]]]. If return_extracted_answer is False, returns only the scores. If return_extracted_answer is True, returns (scores, extracted_answers).

class nemo_rl.environments.math_environment.MultilingualMultichoiceVerifyWorker#
verify(
pred_responses: list[str],
ground_truths: list[str],
return_extracted_answer: bool = False,
) Union[list[float], tuple[list[float], list[str | None]]]#

Verify the correctness of the predicted responses against the ground truth.

Parameters:
  • pred_responses โ€“ list[str]. The predicted responses from the LLM.

  • ground_truths โ€“ list[str]. The ground truth responses.

Returns:

Union[list[float], tuple[list[float], list[str | None]]]. If return_extracted_answer is False, returns only the scores. If return_extracted_answer is True, returns (scores, extracted_answers).

class nemo_rl.environments.math_environment.EnglishMultichoiceVerifyWorker#
verify(
pred_responses: list[str],
ground_truths: list[str],
return_extracted_answer: bool = False,
) Union[list[float], tuple[list[float], list[str | None]]]#

Verify the correctness of the predicted responses against the ground truth.

Parameters:
  • pred_responses โ€“ list[str]. The predicted responses from the LLM.

  • ground_truths โ€“ list[str]. The ground truth responses.

Returns:

Union[list[float], tuple[list[float], list[str | None]]]. If return_extracted_answer is False, returns only the scores. If return_extracted_answer is True, returns (scores, extracted_answers).

class nemo_rl.environments.math_environment.MathEnvironmentMetadata#

Bases: typing.TypedDict

ground_truth: str#

None

extracted_answer: str | None#

None

class nemo_rl.environments.math_environment.MathEnvironment(
cfg: nemo_rl.environments.math_environment.MathEnvConfig,
)#

Bases: nemo_rl.environments.interfaces.EnvironmentInterface[nemo_rl.environments.math_environment.MathEnvironmentMetadata]

shutdown() None#
step(
message_log_batch: list[nemo_rl.data.interfaces.LLMMessageLogType],
metadata: list[nemo_rl.environments.math_environment.MathEnvironmentMetadata],
return_extracted_answer: bool = False,
) nemo_rl.environments.interfaces.EnvironmentReturn[nemo_rl.environments.math_environment.MathEnvironmentMetadata]#

Runs a step in the math environment.

Parameters:
  • message_log โ€“ list[list[dict[str, str]]]. A batch of OpenAI-API-like message logs that represent interactions with the LLM.

  • metadata โ€“ list[MathEnvironmentMetadata]. The grader will use the โ€˜ground_truthโ€™ key to evaluate correctness. The extracted answer will be stored to caculate cons@k.

Returns:

A tuple containing: - list[dict[str, str]]: Observations/responses batch - list[dict]: Updated metadata - list[str]: Next stop strings for the next turn - Tensor: Rewards tensor - Tensor: Done flags tensor

Return type:

EnvironmentReturn

global_post_process_and_metrics(
batch: nemo_rl.distributed.batched_data_dict.BatchedDataDict[Any],
) tuple[nemo_rl.distributed.batched_data_dict.BatchedDataDict[Any], dict[str, float | int]]#

Computes metrics for this environment given a global rollout batch.

Every rank will run this function, so youโ€™re free to use distributed calculations if youโ€™d prefer for heavy metrics.