nemo_rl.environments.math_environment#

Module Contents#

Classes#

Functions#

API#

class nemo_rl.environments.math_environment.MathEnvConfig[source]#

Bases: typing.TypedDict

num_workers: int#

None

stop_strings: Optional[List[str]]#

None

nemo_rl.environments.math_environment._mute_output()[source]#
class nemo_rl.environments.math_environment.HFVerifyWorker[source]#

Initialization

verify(
pred_responses: List[str],
ground_truths: List[str],
) List[float][source]#

Verify the correctness of the predicted responses against the ground truth.

Parameters:
  • pred_responses – List[str]. The predicted responses from the LLM.

  • ground_truths – List[str]. The ground truth responses.

Returns:

List[float]. The rewards for each predicted response.

class nemo_rl.environments.math_environment.MathEnvironmentMetadata[source]#

Bases: typing.TypedDict

ground_truth: str#

None

class nemo_rl.environments.math_environment.MathEnvironment(
cfg: nemo_rl.environments.math_environment.MathEnvConfig,
)[source]#

Bases: nemo_rl.environments.interfaces.EnvironmentInterface

shutdown()[source]#
step(
message_log_batch: List[List[Dict[str, str]]],
metadata: List[nemo_rl.environments.math_environment.MathEnvironmentMetadata],
) nemo_rl.environments.interfaces.EnvironmentReturn[source]#

Runs a step in the math environment.

Parameters:
  • message_log – List[List[Dict[str, str]]]. A batch of OpenAI-API-like message logs that represent interactions with the LLM.

  • metadata – List[MathEnvironmentMetadata]. The grader will use the ‘ground_truth’ key to evaluate correctness.

Returns:

A tuple containing: - List[Dict[str, str]]: Observations/responses batch - List[Dict]: Updated metadata - List[str]: Next stop strings for the next turn - Tensor: Rewards tensor - Tensor: Done flags tensor

Return type:

EnvironmentReturn

global_post_process_and_metrics(
batch: nemo_rl.distributed.batched_data_dict.BatchedDataDict,
) Tuple[nemo_rl.distributed.batched_data_dict.BatchedDataDict, dict][source]#

Computes metrics for this environment given a global rollout batch.

Every rank will run this function, so you’re free to use distributed calculations if you’d prefer for heavy metrics.