NeMo Evaluator Core#

Best for: Developers who need programmatic control

The NeMo Evaluator Core provides direct Python API access for custom configurations and integration into existing Python workflows.

Prerequisites#

Python environment
OpenAI-compatible endpoint (hosted or self-deployed) and an API key (if the endpoint is gated)
Verify endpoint compatibility using our Testing Endpoint Compatibility guide

Quick Start#

# 1. Install the nemo-evaluator and nvidia-simple-evals
pip install nemo-evaluator nvidia-simple-evals

# 2. List available benchmarks and tasks
nemo-evaluator ls

# 3. Run evaluation
# Prerequisites: Set your API key
export NGC_API_KEY="nvapi-..."

# Launch using python:

from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EndpointType,
    EvaluationConfig,
    EvaluationTarget,
)
from nemo_evaluator.core.evaluate import evaluate

# Configure evaluation
eval_config = EvaluationConfig(
    type="mmlu_pro",
    output_dir="./results",
    # Remove limit_samples for full dataset
    params=ConfigParams(limit_samples=10),
)

# Configure target endpoint
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        model_id="meta/llama-3.1-8b-instruct",
        api_key="NGC_API_KEY",
        type=EndpointType.CHAT,
    )
)

# Run evaluation
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
print(f"Evaluation completed: {result}")

Complete Working Example#

Using Python API#

# Prerequisites: Set your API key before running the snippet
# export NGC_API_KEY="nvapi-..."

from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EndpointType,
    EvaluationConfig,
    EvaluationTarget,
)
from nemo_evaluator.core.evaluate import evaluate

# Configure evaluation
eval_config = EvaluationConfig(
    type="mmlu_pro",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=3,
        temperature=0.0,
        max_new_tokens=1024,
        parallelism=1,
        max_retries=5,
    ),
)

# Configure target
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        model_id="meta/llama-3.1-8b-instruct",
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        type=EndpointType.CHAT,
        api_key="NGC_API_KEY",
    )
)

# Run evaluation
try:
    result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
    print(f"Evaluation completed. Results saved to: {eval_config.output_dir}")
except Exception as e:
    print(f"Evaluation failed: {e}")

Using CLI#

nemo-evaluator run_eval \
    --eval_type mmlu_pro \
    --model_id meta/llama-3.1-8b-instruct \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --model_type chat \
    --api_key_name NGC_API_KEY \  # pass variable name not the key itself
    --output_dir ./results \
    --overrides 'config.params.limit_samples=3,config.params.temperature=0.0,config.params.max_new_tokens=1024,config.params.parallelism=1,config.params.max_retries=5'

Key Features#

Programmatic Integration#

Direct Python API access
Pydantic-based configuration with type hints
Integration with existing Python workflows

Evaluation Configuration#

Fine-grained parameter control via ConfigParams
Multiple evaluation types: mmlu_pro, gsm8k, hellaswag, and more
Configurable sampling, temperature, and token limits

Endpoint Support#

Chat endpoints (EndpointType.CHAT)
Completion endpoints (EndpointType.COMPLETIONS)
VLM endpoints (EndpointType.VLM)
Embedding endpoints (EndpointType.EMBEDDING)

Advanced Usage Patterns#

Multi-Benchmark Evaluation#

# Prerequisites: Set your API key
# export NGC_API_KEY="nvapi-..."

from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EndpointType,
    EvaluationConfig,
    EvaluationTarget,
)
from nemo_evaluator.core.evaluate import evaluate

# Configure target once
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        model_id="meta/llama-3.1-8b-instruct",
        api_key="NGC_API_KEY",
        type=EndpointType.CHAT,
    )
)

# Run multiple benchmarks
benchmarks = ["mmlu_pro", "humaneval", "mgsm"]
results = {}

for benchmark in benchmarks:
    config = EvaluationConfig(
        type=benchmark,
        output_dir=f"./results/{benchmark}",
        params=ConfigParams(limit_samples=10),
    )

    result = evaluate(eval_cfg=config, target_cfg=target_config)
    results[benchmark] = result

Discovering Installed Benchmarks#

from nemo_evaluator import show_available_tasks

# List all installed evaluation tasks
show_available_tasks()

Tip

To extend the list of benchmarks install additional harnesses. See the list of evaluation harnesses available as PyPI wheels: Available PyPI Packages.

Using Adapters and Interceptors#

For advanced evaluation scenarios, configure the adapter system with interceptors for request/response processing, caching, logging, and more:

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint, EvaluationConfig, EvaluationTarget, ConfigParams, EndpointType
)
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig

# Configure evaluation target with adapter
api_endpoint = ApiEndpoint(
    url="http://0.0.0.0:8080/v1/completions/",
    type=EndpointType.COMPLETIONS,
    model_id="my_model"
)

# Create adapter configuration with interceptors
api_endpoint.adapter_config = AdapterConfig(
    interceptors=[
        InterceptorConfig(
            name="system_message",
            config={"system_message": "You are a helpful AI assistant. Think step by step."}
        ),
        InterceptorConfig(
            name="request_logging",
            config={"max_requests": 50}
        ),
        InterceptorConfig(
            name="caching",
            config={
                "cache_dir": "./evaluation_cache",
                "reuse_cached_responses": True
            }
        ),
        InterceptorConfig(
            name="endpoint",
        ),
        InterceptorConfig(
            name="response_logging",
            config={"max_responses": 50}
        ),
        InterceptorConfig(
            name="reasoning",
            config={
                "start_reasoning_token": "<think>",
                "end_reasoning_token": "</think>"
            }
        ),
        InterceptorConfig(
            name="progress_tracking",
            config={"progress_tracking_url": "http://localhost:3828/progress"}
        )
    ]
)

target = EvaluationTarget(api_endpoint=api_endpoint)

# Run evaluation with full adapter pipeline
config = EvaluationConfig(
    type="gsm8k",
    output_dir="./results/gsm8k",
    params=ConfigParams(
        limit_samples=10,
        temperature=0.0,
        max_new_tokens=512,
        parallelism=1
    )
)

result = evaluate(eval_cfg=config, target_cfg=target)
print(f"Evaluation completed: {result}")

Available Interceptors:

system_message: Add custom system prompts to chat requests
request_logging: Log incoming requests for debugging
response_logging: Log outgoing responses for debugging
caching: Cache responses to reduce API costs and speed up reruns
reasoning: Extract chain-of-thought reasoning from model responses
progress_tracking: Track evaluation progress and send updates

For complete adapter documentation, refer to Usage.

Next Steps#

Integrate into your existing Python workflows
Run multiple benchmarks in sequence
Explore available evaluation types with show_available_tasks()
Configure adapters and interceptors for advanced evaluation scenarios
Consider NeMo Evaluator Launcher for CLI workflows
Try Container Direct for containerized environments