NeMo Evaluator Core#
Best for: Developers who need programmatic control
The NeMo Evaluator Core provides direct Python API access for custom configurations and integration into existing Python workflows.
Prerequisites#
Python environment
OpenAI-compatible endpoint (hosted or self-deployed) and an API key (if the endpoint is gated)
Quick Start#
# 1. Install the nemo-evaluator and nvidia-simple-evals
pip install nemo-evaluator nvidia-simple-evals
# 2. List available benchmarks and tasks
nemo-evaluator ls
# 3. Run evaluation
# Prerequisites: Set your API key
export NGC_API_KEY="nvapi-..."
# Launch using python:
from nemo_evaluator.api.api_dataclasses import (
ApiEndpoint,
ConfigParams,
EndpointType,
EvaluationConfig,
EvaluationTarget,
)
from nemo_evaluator.core.evaluate import evaluate
# Configure evaluation
eval_config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
# Remove limit_samples for full dataset
params=ConfigParams(limit_samples=10),
)
# Configure target endpoint
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.nvidia.com/v1/chat/completions",
model_id="meta/llama-3.1-8b-instruct",
api_key="NGC_API_KEY",
type=EndpointType.CHAT,
)
)
# Run evaluation
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
print(f"Evaluation completed: {result}")
Complete Working Example#
Using Python API#
# Prerequisites: Set your API key before running the snippet
# export NGC_API_KEY="nvapi-..."
from nemo_evaluator.api.api_dataclasses import (
ApiEndpoint,
ConfigParams,
EndpointType,
EvaluationConfig,
EvaluationTarget,
)
from nemo_evaluator.core.evaluate import evaluate
# Configure evaluation
eval_config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=3,
temperature=0.0,
max_new_tokens=1024,
parallelism=1,
max_retries=5,
),
)
# Configure target
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
model_id="meta/llama-3.1-8b-instruct",
url="https://integrate.api.nvidia.com/v1/chat/completions",
type=EndpointType.CHAT,
api_key="NGC_API_KEY",
)
)
# Run evaluation
try:
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
print(f"Evaluation completed. Results saved to: {eval_config.output_dir}")
except Exception as e:
print(f"Evaluation failed: {e}")
Using CLI#
nemo-evaluator run_eval \
--eval_type mmlu_pro \
--model_id meta/llama-3.1-8b-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat \
--api_key_name NGC_API_KEY \ # pass variable name not the key itself
--output_dir ./results \
--overrides 'config.params.limit_samples=3,config.params.temperature=0.0,config.params.max_new_tokens=1024,config.params.parallelism=1,config.params.max_retries=5'
Key Features#
Programmatic Integration#
Direct Python API access
Pydantic-based configuration with type hints
Integration with existing Python workflows
Evaluation Configuration#
Fine-grained parameter control via
ConfigParams
Multiple evaluation types:
mmlu_pro
,gsm8k
,hellaswag
, and moreConfigurable sampling, temperature, and token limits
Endpoint Support#
Chat endpoints (
EndpointType.CHAT
)Completion endpoints (
EndpointType.COMPLETIONS
)VLM endpoints (
EndpointType.VLM
)Embedding endpoints (
EndpointType.EMBEDDING
)
Advanced Usage Patterns#
Multi-Benchmark Evaluation#
# Prerequisites: Set your API key
# export NGC_API_KEY="nvapi-..."
from nemo_evaluator.api.api_dataclasses import (
ApiEndpoint,
ConfigParams,
EndpointType,
EvaluationConfig,
EvaluationTarget,
)
from nemo_evaluator.core.evaluate import evaluate
# Configure target once
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.nvidia.com/v1/chat/completions",
model_id="meta/llama-3.1-8b-instruct",
api_key="NGC_API_KEY",
type=EndpointType.CHAT,
)
)
# Run multiple benchmarks
benchmarks = ["mmlu_pro", "humaneval", "mgsm"]
results = {}
for benchmark in benchmarks:
config = EvaluationConfig(
type=benchmark,
output_dir=f"./results/{benchmark}",
params=ConfigParams(limit_samples=10),
)
result = evaluate(eval_cfg=config, target_cfg=target_config)
results[benchmark] = result
Discovering Installed Benchmarks#
from nemo_evaluator import show_available_tasks
# List all installed evaluation tasks
show_available_tasks()
Tip
To extend the list of benchmarks install additional harnesses. See the list of evaluation harnesses available as PyPI wheels: Available PyPI Packages.
Using Adapters and Interceptors#
For advanced evaluation scenarios, configure the adapter system with interceptors for request/response processing, caching, logging, and more:
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
ApiEndpoint, EvaluationConfig, EvaluationTarget, ConfigParams, EndpointType
)
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
# Configure evaluation target with adapter
api_endpoint = ApiEndpoint(
url="http://0.0.0.0:8080/v1/completions/",
type=EndpointType.COMPLETIONS,
model_id="my_model"
)
# Create adapter configuration with interceptors
api_endpoint.adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="system_message",
config={"system_message": "You are a helpful AI assistant. Think step by step."}
),
InterceptorConfig(
name="request_logging",
config={"max_requests": 50}
),
InterceptorConfig(
name="caching",
config={
"cache_dir": "./evaluation_cache",
"reuse_cached_responses": True
}
),
InterceptorConfig(
name="endpoint",
),
InterceptorConfig(
name="response_logging",
config={"max_responses": 50}
),
InterceptorConfig(
name="reasoning",
config={
"start_reasoning_token": "<think>",
"end_reasoning_token": "</think>"
}
),
InterceptorConfig(
name="progress_tracking",
config={"progress_tracking_url": "http://localhost:3828/progress"}
)
]
)
target = EvaluationTarget(api_endpoint=api_endpoint)
# Run evaluation with full adapter pipeline
config = EvaluationConfig(
type="gsm8k",
output_dir="./results/gsm8k",
params=ConfigParams(
limit_samples=10,
temperature=0.0,
max_new_tokens=512,
parallelism=1
)
)
result = evaluate(eval_cfg=config, target_cfg=target)
print(f"Evaluation completed: {result}")
Available Interceptors:
system_message
: Add custom system prompts to chat requestsrequest_logging
: Log incoming requests for debuggingresponse_logging
: Log outgoing responses for debuggingcaching
: Cache responses to reduce API costs and speed up rerunsreasoning
: Extract chain-of-thought reasoning from model responsesprogress_tracking
: Track evaluation progress and send updates
For complete adapter documentation, refer to Usage.
Next Steps#
Integrate into your existing Python workflows
Run multiple benchmarks in sequence
Explore available evaluation types with
show_available_tasks()
Configure adapters and interceptors for advanced evaluation scenarios
Consider NeMo Evaluator Launcher for CLI workflows
Try Container Direct for containerized environments