NeMo Evaluator Core#

Best for: Developers who need programmatic control

The NeMo Evaluator Core provides direct Python API access for custom configurations and integration into existing Python workflows.

Prerequisites#

Python environment with nemo-evaluator installed
OpenAI-compatible endpoint

Quick Start#

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig, 
    EvaluationTarget, 
    ApiEndpoint, 
    EndpointType,
    ConfigParams
)

# Configure evaluation
eval_config = EvaluationConfig(
    type="mmlu_pro",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=10,
        temperature=0.0,
        max_new_tokens=1024,
        parallelism=1
    )
)

# Configure target endpoint
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        model_id="meta/llama-3.1-8b-instruct",
        api_key="your_api_key_here",
        type=EndpointType.CHAT
    )
)

# Run evaluation
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
print(f"Evaluation completed: {result}")

Complete Working Example#

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig, 
    EvaluationTarget, 
    ApiEndpoint, 
    EndpointType, 
    ConfigParams
)

# Set up environment
os.environ["NGC_API_KEY"] = "nvapi-your-key-here"

# Configure evaluation
eval_config = EvaluationConfig(
    type="mmlu_pro",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=3,
        temperature=0.0,
        max_new_tokens=1024,
        parallelism=1,
        max_retries=5
    )
)

# Configure target
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        model_id="meta/llama-3.1-8b-instruct",
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        type=EndpointType.CHAT,
        api_key=os.environ["NGC_API_KEY"]
    )
)

# Run evaluation
try:
    result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
    print(f"Evaluation completed. Results saved to: {eval_config.output_dir}")
except Exception as e:
    print(f"Evaluation failed: {e}")

Key Features#

Programmatic Integration#

Direct Python API access
Pydantic-based configuration with type hints
Integration with existing Python workflows

Evaluation Configuration#

Fine-grained parameter control via ConfigParams
Multiple evaluation types: mmlu_pro, gsm8k, hellaswag, and more
Configurable sampling, temperature, and token limits

Endpoint Support#

Chat endpoints (EndpointType.CHAT)
Completion endpoints (EndpointType.COMPLETIONS)
VLM endpoints (EndpointType.VLM)
Embedding endpoints (EndpointType.EMBEDDING)

Advanced Usage Patterns#

Multi-Benchmark Evaluation#

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig, EvaluationTarget, ApiEndpoint, EndpointType, ConfigParams
)

# Configure target once
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        model_id="meta/llama-3.1-8b-instruct",
        api_key="your_api_key_here",
        type=EndpointType.CHAT
    )
)

# Run multiple benchmarks
benchmarks = ["gsm8k", "hellaswag", "arc_easy"]
results = {}

for benchmark in benchmarks:
    config = EvaluationConfig(
        type=benchmark,
        output_dir=f"./results/{benchmark}",
        params=ConfigParams(limit_samples=10)
    )
    
    result = evaluate(eval_cfg=config, target_cfg=target_config)
    results[benchmark] = result

Discovering Available Benchmarks#

from nemo_evaluator import show_available_tasks

# List all installed evaluation tasks
show_available_tasks()

Using Adapters and Interceptors#

For advanced evaluation scenarios, configure the adapter system with interceptors for request/response processing, caching, logging, and more:

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint, EvaluationConfig, EvaluationTarget, ConfigParams, EndpointType
)
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig

# Configure evaluation target with adapter
api_endpoint = ApiEndpoint(
    url="http://0.0.0.0:8080/v1/completions/",
    type=EndpointType.COMPLETIONS,
    model_id="my_model"
)

# Create adapter configuration with interceptors
api_endpoint.adapter_config = AdapterConfig(
    interceptors=[
        InterceptorConfig(
            name="system_message",
            config={"system_message": "You are a helpful AI assistant. Think step by step."}
        ),
        InterceptorConfig(
            name="request_logging",
            config={"max_requests": 50}
        ),
        InterceptorConfig(
            name="caching",
            config={
                "cache_dir": "./evaluation_cache",
                "reuse_cached_responses": True
            }
        ),
        InterceptorConfig(
            name="response_logging",
            config={"max_responses": 50}
        ),
        InterceptorConfig(
            name="reasoning",
            config={
                "start_reasoning_token": "<think>",
                "end_reasoning_token": "</think>"
            }
        ),
        InterceptorConfig(
            name="progress_tracking",
            config={"progress_tracking_url": "http://localhost:3828/progress"}
        )
    ]
)

target = EvaluationTarget(api_endpoint=api_endpoint)

# Run evaluation with full adapter pipeline
config = EvaluationConfig(
    type="gsm8k",
    output_dir="./results/gsm8k",
    params=ConfigParams(
        limit_samples=10,
        temperature=0.0,
        max_new_tokens=512,
        parallelism=1
    )
)

result = evaluate(eval_cfg=config, target_cfg=target)
print(f"Evaluation completed: {result}")

Available Interceptors:

system_message: Add custom system prompts to chat requests
request_logging: Log incoming requests for debugging
response_logging: Log outgoing responses for debugging
caching: Cache responses to reduce API costs and speed up reruns
reasoning: Extract chain-of-thought reasoning from model responses
progress_tracking: Track evaluation progress and send updates

For complete adapter documentation, refer to Usage.

Next Steps#

Integrate into your existing Python workflows
Run multiple benchmarks in sequence
Explore available evaluation types with show_available_tasks()
Configure adapters and interceptors for advanced evaluation scenarios
Consider NeMo Evaluator Launcher for CLI workflows
Try Container Direct for containerized environments