`nemo_evaluator.api`#

The central point of evaluation is evaluate() function that takes standarized input and returns standarized output. See nemo_evaluator.api.api_dataclasses to learn how to instantiate standardized input and consume standardized output. Below is an example of how one might configure and run evaluation via Python API:

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig,
    EvaluationTarget,
    ConfigParams,
    ApiEndpoint
)

# Create evaluation configuration
eval_config = EvaluationConfig(
    type="simple_evals.mmlu_pro",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=100,
        temperature=0.1
    )
)

# Create target configuration
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url="https://integrate.api.NVIDIA.com/v1/chat/completions",
        model_id="meta/llama-3.2-3b-instruct",
        type="chat",
        api_key="MY_API_KEY" # Name of the environment variable that stores api_key
    )
)

# Run evaluation
result = evaluate(eval_config, target_config)

nemo_evaluator.api.evaluate( eval_cfg: EvaluationConfig, target_cfg: EvaluationTarget, metadata: EvaluationMetadata | None = None, ) → EvaluationResult[source]#

Run an evaluation using configuration objects.

Parameters:

eval_cfg – Evaluation configuration object containing output directory, parameters, and evaluation type
target_cfg – Target configuration object containing API endpoint details and adapter configuration

Returns:

Evaluation results and metadata

Return type:

EvaluationResult

nemo_evaluator.api.show_available_tasks() → None[source]#

Prints all available evaluations in the format of:

{harness1}:
    * benchmark A
    * benchmark B
{harness2}:
    * benchmark A
    * benchmark B
...

Important

Only evaluations from installed wheels are being displayed.

nemo_evaluator.api.get_available_evaluations() → tuple[dict[str, dict[str, Evaluation]], dict[str, Evaluation], dict][source]#

Returns all available evaluations in Evaluation objects .. important:: Only evaluations from installed wheels are being returned.

Returns:: Tuple with the following elements: 1. Mapping: harness name -> tasks (dict) 2. Mapping: harness name -> default configs (for non exposed tasks). Returned Evaluation should serve as a blueprint 3. Mapping: task name -> list of Evaluations
Return type:: tuple[ dict[str, dict[str, Evaluation]], dict[str, Evaluation], dict ]

nemo_evaluator.api.check_endpoint( endpoint_url: str, endpoint_type: Literal['completions', 'chat'], model_name: str, max_retries: int = 600, retry_interval: int = 2, ) → bool[source]#

Checks if the OpenAI-compatible endpoint is alive by sending a simple prompt.

Parameters:

endpoint_url (str) – Full endpoint URL. For most servers that means either /v1/chat/completions or /completions must be provided
endpoint_type (Literal[completions, chat]) – indicates if the model is instruction-tuned (chat) or a base model (completions). Used to constuct a proper payload structure.
model_name (str) – model name that is linked to payload. Might be required by some endpoint.
max_retries (int, optional) – How many attempt before returning false. Defaults to 600.
retry_interval (int, optional) – How many seconds to wait between attempts. Defaults to 2.

Raises:

ValueError – if endpoint_type was not one of “completions”, “chat”

Returns:

whether the endpoint is alive

Return type:

bool

nemo_evaluator.api#

`nemo_evaluator.api`#