nemo_rl.evals.eval
#
Module Contents#
Classes#
Functions#
Set up components for model evaluation. |
|
Evaluate pass@k score using an unbiased estimator. |
|
Evaluate cons@k score using an unbiased estimator. |
|
Main entry point for running evaluation using environment. |
|
Unified implementation for both sync and async evaluation. |
|
Generate texts using either sync or async method. |
|
Save evaluation data to a JSON file. |
|
Print evaluation results. |
API#
- class nemo_rl.evals.eval.EvalConfig#
Bases:
typing.TypedDict
- metric: str#
None
- num_tests_per_prompt: int#
None
- seed: int#
None
- k_value: int#
None
- save_path: str | None#
None
- class nemo_rl.evals.eval.MasterConfig#
Bases:
typing.TypedDict
- eval: nemo_rl.evals.eval.EvalConfig#
None
- generation: nemo_rl.models.generation.interfaces.GenerationConfig#
None
- tokenizer: nemo_rl.models.policy.TokenizerConfig#
None
- data: nemo_rl.data.MathDataConfig#
None
- cluster: nemo_rl.distributed.virtual_cluster.ClusterConfig#
None
- nemo_rl.evals.eval.setup(
- master_config: nemo_rl.evals.eval.MasterConfig,
- tokenizer: transformers.AutoTokenizer,
- dataset: nemo_rl.data.datasets.AllTaskProcessedDataset,
Set up components for model evaluation.
Initializes the VLLM model and data loader.
- Parameters:
master_config β Configuration settings.
dataset β Dataset to evaluate on.
- Returns:
VLLM model, data loader, and config.
- nemo_rl.evals.eval.eval_pass_k(
- rewards: torch.Tensor,
- num_tests_per_prompt: int,
- k: int,
Evaluate pass@k score using an unbiased estimator.
Reference: https://github.com/huggingface/evaluate/blob/32546aafec25cdc2a5d7dd9f941fc5be56ba122f/metrics/code_eval/code_eval.py#L198-L213
- Parameters:
rewards β Tensor of shape (batch_size * num_tests_per_prompt)
k β int (pass@k value)
- Returns:
float
- Return type:
pass_k_score
- nemo_rl.evals.eval.eval_cons_k(
- rewards: torch.Tensor,
- num_tests_per_prompt: int,
- k: int,
- extracted_answers: list[str | None],
Evaluate cons@k score using an unbiased estimator.
- Parameters:
rewards β Tensor of shape (batch_size * num_tests_per_prompt)
num_tests_per_prompt β int
k β int
extracted_answers β list[str| None]
- Returns:
float
- Return type:
cons_k_score
- nemo_rl.evals.eval.run_env_eval(vllm_generation, dataloader, env, master_config)#
Main entry point for running evaluation using environment.
Generates model responses and evaluates them by env.
- Parameters:
vllm_generation β Model for generating responses.
dataloader β Data loader with evaluation samples.
env β Environment that scores responses.
master_config β Configuration settings.
- async nemo_rl.evals.eval._run_env_eval_impl(
- vllm_generation,
- dataloader,
- env,
- master_config,
- use_async=False,
Unified implementation for both sync and async evaluation.
- async nemo_rl.evals.eval._generate_texts(vllm_generation, inputs, use_async)#
Generate texts using either sync or async method.
- nemo_rl.evals.eval._save_evaluation_data_to_json(
- evaluation_data,
- master_config,
- save_path,
Save evaluation data to a JSON file.
- Parameters:
evaluation_data β List of evaluation samples
master_config β Configuration dictionary
save_path β Path to save evaluation results. Set to null to disable saving. Example: βresults/eval_outputβ or β/path/to/evaluation_resultsβ
- nemo_rl.evals.eval._print_results(
- master_config,
- generation_config,
- score,
- dataset_size,
- metric,
- k_value,
- num_tests_per_prompt,
Print evaluation results.