nemo_rl.environments.code_jaccard_environment#

Module Contents#

Classes#

CodeJaccardEnvConfig

CodeJaccardEnvironmentMetadata

CodeJaccardVerifyWorker

Worker for evaluating code responses using Jaccard-based similarity.

CodeJaccardEnvironment

Environment for evaluating code responses using Jaccard similarity.

API#

class nemo_rl.environments.code_jaccard_environment.CodeJaccardEnvConfig#

Bases: typing.TypedDict

num_workers: int#

None

stop_strings: Optional[list[str]]#

None

class nemo_rl.environments.code_jaccard_environment.CodeJaccardEnvironmentMetadata#

Bases: typing.TypedDict

ground_truth: str#

None

class nemo_rl.environments.code_jaccard_environment.CodeJaccardVerifyWorker#

Worker for evaluating code responses using Jaccard-based similarity.

Initialization

verify(
pred_responses: list[str],
ground_truths: list[str],
return_extracted_answer: bool = False,
) Union[list[float], tuple[list[float], list[str | None]]]#

Verify code responses against ground-truth solutions using Jaccard-based similarity.

We use a simple text similarity approach (Jaccard over tokenized words) to evaluate how well the model’s response aligns with the ground truth.

Parameters:
  • pred_responses – list[str]. The predicted responses from the LLM.

  • ground_truths – list[str]. The ground-truth solutions.

  • return_extracted_answer – bool. Whether to return extracted answers (here, the full response).

Returns:

Union[list[float], tuple[list[float], list[str | None]]]. If return_extracted_answer is False, returns only the scores. If return_extracted_answer is True, returns (scores, extracted_answers).

_calculate_preference_score(response: str, ground_truth: str) float#

Calculate a Jaccard-based alignment score between response and ground truth.

This is a simplified scoring function. In practice, you might want to use:

  • Semantic similarity models

  • BLEU/ROUGE scores

  • Tokenize both texts into sets A and B (here we use whitespace tokenization).

  • Compute intersection size |A ∩ B| and union size |A ∪ B|.

  • J(A, B) = |A ∩ B| / |A ∪ B|, with guards for union=0 -> 0.0.

  • Optionally combine with a length-ratio penalty to discourage degenerate very short/long matches.

Complexity:

  • Tokenization: O(n + m)

  • Set ops: O(n + m) average (hash sets)

Parameters:

response – The model’s response

Returns:

Score between 0.0 and 1.0

Return type:

float

class nemo_rl.environments.code_jaccard_environment.CodeJaccardEnvironment(
cfg: nemo_rl.environments.code_jaccard_environment.CodeJaccardEnvConfig,
)#

Bases: nemo_rl.environments.interfaces.EnvironmentInterface[nemo_rl.environments.code_jaccard_environment.CodeJaccardEnvironmentMetadata]

Environment for evaluating code responses using Jaccard similarity.

Initialization

shutdown() None#

Shutdown all workers.

step(
message_log_batch: list[nemo_rl.data.interfaces.LLMMessageLogType],
metadata: list[nemo_rl.environments.code_jaccard_environment.CodeJaccardEnvironmentMetadata],
return_extracted_answer: bool = False,
) nemo_rl.environments.interfaces.EnvironmentReturn[nemo_rl.environments.code_jaccard_environment.CodeJaccardEnvironmentMetadata]#

Runs a step in the CodeJaccard environment.

Parameters:
  • message_log_batch – Batch of OpenAI-API-like message logs.

  • metadata – Batch of CodeJaccardEnvironmentMetadata with ground truth.

  • return_extracted_answer – Whether to return extracted answers.

Returns:

Tuple containing observations, metadata, stop strings, rewards, and done flags.

Return type:

EnvironmentReturn

global_post_process_and_metrics(
batch: nemo_rl.distributed.batched_data_dict.BatchedDataDict[Any],
) tuple[nemo_rl.distributed.batched_data_dict.BatchedDataDict[Any], dict[str, float | int]]#

Post-process batch and compute metrics for CodeJaccard.