nemo_evaluator.api#

The central point of evaluation is evaluate() function that takes standarized input and returns standarized output. See nemo_evaluator.api.api_dataclasses to learn how to instantiate standardized input and consume standardized output. Below is an example of how one might configure and run evaluation via Python API:

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig,
    EvaluationTarget,
    ConfigParams,
    ApiEndpoint
)

# Create evaluation configuration
eval_config = EvaluationConfig(
    type="simple_evals.mmlu_pro",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=100,
        temperature=0.1
    )
)

# Create target configuration
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url="https://integrate.api.NVIDIA.com/v1/chat/completions",
        model_id="meta/llama-3.1-8b-instruct",
        type="chat",
        api_key="MY_API_KEY" # Name of the environment variable that stores api_key
    )
)

# Run evaluation
result = evaluate(eval_config, target_config)
nemo_evaluator.api.evaluate(
eval_cfg: EvaluationConfig,
target_cfg: EvaluationTarget,
metadata: EvaluationMetadata | None = None,
) EvaluationResult[source]#

Run an evaluation using configuration objects.

Parameters:
  • eval_cfg – Evaluation configuration object containing output directory, parameters, and evaluation type

  • target_cfg – Target configuration object containing API endpoint details and adapter configuration

Returns:

Evaluation results and metadata

Return type:

EvaluationResult

nemo_evaluator.api.show_available_tasks() None[source]#

Prints all available evaluations in the format of:

{harness1}:
    * benchmark A
    * benchmark B
{harness2}:
    * benchmark A
    * benchmark B
...

Important

Only evaluations from installed wheels are being displayed.

nemo_evaluator.api.get_available_evaluations() tuple[dict[str, dict[str, Evaluation]], dict[str, Evaluation], dict][source]#

Returns all available evaluations in Evaluation objects .. important:: Only evaluations from installed wheels are being returned.

Returns:

Tuple with the following elements: 1. Mapping: harness name -> tasks (dict) 2. Mapping: harness name -> default configs (for non exposed tasks). Returned Evaluation should serve as a blueprint 3. Mapping: task name -> list of Evaluations

Return type:

tuple[ dict[str, dict[str, Evaluation]], dict[str, Evaluation], dict ]

nemo_evaluator.api.check_endpoint(
endpoint_url: str,
endpoint_type: Literal['completions', 'chat'],
model_name: str,
max_retries: int = 600,
retry_interval: int = 2,
) bool[source]#

Checks if the OpenAI-compatible endpoint is alive by sending a simple prompt.

Parameters:
  • endpoint_url (str) – Full endpoint URL. For most servers that means either /v1/chat/completions or /completions must be provided

  • endpoint_type (Literal[completions, chat]) – indicates if the model is instruction-tuned (chat) or a base model (completions). Used to constuct a proper payload structure.

  • model_name (str) – model name that is linked to payload. Might be required by some endpoint.

  • max_retries (int, optional) – How many attempt before returning false. Defaults to 600.

  • retry_interval (int, optional) – How many seconds to wait between attempts. Defaults to 2.

Raises:

ValueError – if endpoint_type was not one of “completions”, “chat”

Returns:

whether the endpoint is alive

Return type:

bool