nemo_rl.environments.code_jaccard_environment#
Module Contents#
Classes#
Worker for evaluating code responses using Jaccard-based similarity. |
|
Environment for evaluating code responses using Jaccard similarity. |
API#
- class nemo_rl.environments.code_jaccard_environment.CodeJaccardEnvConfig#
Bases:
typing.TypedDict- num_workers: int#
None
- stop_strings: Optional[list[str]]#
None
- class nemo_rl.environments.code_jaccard_environment.CodeJaccardEnvironmentMetadata#
Bases:
typing.TypedDict- ground_truth: str#
None
- class nemo_rl.environments.code_jaccard_environment.CodeJaccardVerifyWorker#
Worker for evaluating code responses using Jaccard-based similarity.
Initialization
- verify(
- pred_responses: list[str],
- ground_truths: list[str],
- return_extracted_answer: bool = False,
Verify code responses against ground-truth solutions using Jaccard-based similarity.
We use a simple text similarity approach (Jaccard over tokenized words) to evaluate how well the model’s response aligns with the ground truth.
- Parameters:
pred_responses – list[str]. The predicted responses from the LLM.
ground_truths – list[str]. The ground-truth solutions.
return_extracted_answer – bool. Whether to return extracted answers (here, the full response).
- Returns:
Union[list[float], tuple[list[float], list[str | None]]]. If return_extracted_answer is False, returns only the scores. If return_extracted_answer is True, returns (scores, extracted_answers).
- _calculate_preference_score(response: str, ground_truth: str) float#
Calculate a Jaccard-based alignment score between response and ground truth.
This is a simplified scoring function. In practice, you might want to use:
Semantic similarity models
BLEU/ROUGE scores
Tokenize both texts into sets A and B (here we use whitespace tokenization).
Compute intersection size |A ∩ B| and union size |A ∪ B|.
J(A, B) = |A ∩ B| / |A ∪ B|, with guards for union=0 -> 0.0.
Optionally combine with a length-ratio penalty to discourage degenerate very short/long matches.
Complexity:
Tokenization: O(n + m)
Set ops: O(n + m) average (hash sets)
- Parameters:
response – The model’s response
- Returns:
Score between 0.0 and 1.0
- Return type:
float
- class nemo_rl.environments.code_jaccard_environment.CodeJaccardEnvironment( )#
Bases:
nemo_rl.environments.interfaces.EnvironmentInterface[nemo_rl.environments.code_jaccard_environment.CodeJaccardEnvironmentMetadata]Environment for evaluating code responses using Jaccard similarity.
Initialization
- shutdown() None#
Shutdown all workers.
- step(
- message_log_batch: list[nemo_rl.data.interfaces.LLMMessageLogType],
- metadata: list[nemo_rl.environments.code_jaccard_environment.CodeJaccardEnvironmentMetadata],
- return_extracted_answer: bool = False,
Runs a step in the CodeJaccard environment.
- Parameters:
message_log_batch – Batch of OpenAI-API-like message logs.
metadata – Batch of CodeJaccardEnvironmentMetadata with ground truth.
return_extracted_answer – Whether to return extracted answers.
- Returns:
Tuple containing observations, metadata, stop strings, rewards, and done flags.
- Return type:
- global_post_process_and_metrics( ) tuple[nemo_rl.distributed.batched_data_dict.BatchedDataDict[Any], dict[str, float | int]]#
Post-process batch and compute metrics for CodeJaccard.