nemo_evaluator.api#
The central point of evaluation is evaluate() function that takes standarized input and returns standarized output. See nemo_evaluator.api.api_dataclasses to learn how to instantiate standardized input and consume standardized output. Below is an example of how one might configure and run evaluation via Python API:
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ConfigParams,
ApiEndpoint
)
# Create evaluation configuration
eval_config = EvaluationConfig(
type="simple_evals.mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=100,
temperature=0.1
)
)
# Create target configuration
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.NVIDIA.com/v1/chat/completions",
model_id="meta/llama-3.1-8b-instruct",
type="chat",
api_key="MY_API_KEY" # Name of the environment variable that stores api_key
)
)
# Run evaluation
result = evaluate(eval_config, target_config)
- nemo_evaluator.api.evaluate(
- eval_cfg: EvaluationConfig,
- target_cfg: EvaluationTarget,
- metadata: EvaluationMetadata | None = None,
Run an evaluation using configuration objects.
- Parameters:
eval_cfg – Evaluation configuration object containing output directory, parameters, and evaluation type
target_cfg – Target configuration object containing API endpoint details and adapter configuration
- Returns:
Evaluation results and metadata
- Return type:
- nemo_evaluator.api.show_available_tasks() None[source]#
Prints all available evaluations in the format of:
{harness1}: * benchmark A * benchmark B {harness2}: * benchmark A * benchmark B ...
Important
Only evaluations from installed wheels are being displayed.
- nemo_evaluator.api.get_available_evaluations() tuple[dict[str, dict[str, Evaluation]], dict[str, Evaluation], dict][source]#
Returns all available evaluations in Evaluation objects .. important:: Only evaluations from installed wheels are being returned.
- Returns:
Tuple with the following elements: 1. Mapping: harness name -> tasks (dict) 2. Mapping: harness name -> default configs (for non exposed tasks). Returned Evaluation should serve as a blueprint 3. Mapping: task name -> list of Evaluations
- Return type:
tuple[ dict[str, dict[str, Evaluation]], dict[str, Evaluation], dict ]
- nemo_evaluator.api.check_endpoint(
- endpoint_url: str,
- endpoint_type: Literal['completions', 'chat'],
- model_name: str,
- max_retries: int = 600,
- retry_interval: int = 2,
Checks if the OpenAI-compatible endpoint is alive by sending a simple prompt.
- Parameters:
endpoint_url (str) – Full endpoint URL. For most servers that means either
/v1/chat/completionsor/completionsmust be providedendpoint_type (Literal[completions, chat]) – indicates if the model is instruction-tuned (chat) or a base model (completions). Used to constuct a proper payload structure.
model_name (str) – model name that is linked to payload. Might be required by some endpoint.
max_retries (int, optional) – How many attempt before returning false. Defaults to 600.
retry_interval (int, optional) – How many seconds to wait between attempts. Defaults to 2.
- Raises:
ValueError – if endpoint_type was not one of “completions”, “chat”
- Returns:
whether the endpoint is alive
- Return type:
bool