NeMo Evaluator Core#
Best for: Developers who need programmatic control
The NeMo Evaluator Core provides direct Python API access for custom configurations and integration into existing Python workflows.
Prerequisites#
Python environment with nemo-evaluator installed
OpenAI-compatible endpoint
Quick Start#
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ApiEndpoint,
EndpointType,
ConfigParams
)
# Configure evaluation
eval_config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=10,
temperature=0.0,
max_new_tokens=1024,
parallelism=1
)
)
# Configure target endpoint
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.nvidia.com/v1/chat/completions",
model_id="meta/llama-3.1-8b-instruct",
api_key="your_api_key_here",
type=EndpointType.CHAT
)
)
# Run evaluation
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
print(f"Evaluation completed: {result}")
Complete Working Example#
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ApiEndpoint,
EndpointType,
ConfigParams
)
# Set up environment
os.environ["NGC_API_KEY"] = "nvapi-your-key-here"
# Configure evaluation
eval_config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=3,
temperature=0.0,
max_new_tokens=1024,
parallelism=1,
max_retries=5
)
)
# Configure target
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
model_id="meta/llama-3.1-8b-instruct",
url="https://integrate.api.nvidia.com/v1/chat/completions",
type=EndpointType.CHAT,
api_key=os.environ["NGC_API_KEY"]
)
)
# Run evaluation
try:
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
print(f"Evaluation completed. Results saved to: {eval_config.output_dir}")
except Exception as e:
print(f"Evaluation failed: {e}")
Key Features#
Programmatic Integration#
Direct Python API access
Pydantic-based configuration with type hints
Integration with existing Python workflows
Evaluation Configuration#
Fine-grained parameter control via
ConfigParams
Multiple evaluation types:
mmlu_pro
,gsm8k
,hellaswag
, and moreConfigurable sampling, temperature, and token limits
Endpoint Support#
Chat endpoints (
EndpointType.CHAT
)Completion endpoints (
EndpointType.COMPLETIONS
)VLM endpoints (
EndpointType.VLM
)Embedding endpoints (
EndpointType.EMBEDDING
)
Advanced Usage Patterns#
Multi-Benchmark Evaluation#
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig, EvaluationTarget, ApiEndpoint, EndpointType, ConfigParams
)
# Configure target once
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.nvidia.com/v1/chat/completions",
model_id="meta/llama-3.1-8b-instruct",
api_key="your_api_key_here",
type=EndpointType.CHAT
)
)
# Run multiple benchmarks
benchmarks = ["gsm8k", "hellaswag", "arc_easy"]
results = {}
for benchmark in benchmarks:
config = EvaluationConfig(
type=benchmark,
output_dir=f"./results/{benchmark}",
params=ConfigParams(limit_samples=10)
)
result = evaluate(eval_cfg=config, target_cfg=target_config)
results[benchmark] = result
Discovering Available Benchmarks#
from nemo_evaluator import show_available_tasks
# List all installed evaluation tasks
show_available_tasks()
Using Adapters and Interceptors#
For advanced evaluation scenarios, configure the adapter system with interceptors for request/response processing, caching, logging, and more:
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
ApiEndpoint, EvaluationConfig, EvaluationTarget, ConfigParams, EndpointType
)
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
# Configure evaluation target with adapter
api_endpoint = ApiEndpoint(
url="http://0.0.0.0:8080/v1/completions/",
type=EndpointType.COMPLETIONS,
model_id="my_model"
)
# Create adapter configuration with interceptors
api_endpoint.adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="system_message",
config={"system_message": "You are a helpful AI assistant. Think step by step."}
),
InterceptorConfig(
name="request_logging",
config={"max_requests": 50}
),
InterceptorConfig(
name="caching",
config={
"cache_dir": "./evaluation_cache",
"reuse_cached_responses": True
}
),
InterceptorConfig(
name="response_logging",
config={"max_responses": 50}
),
InterceptorConfig(
name="reasoning",
config={
"start_reasoning_token": "<think>",
"end_reasoning_token": "</think>"
}
),
InterceptorConfig(
name="progress_tracking",
config={"progress_tracking_url": "http://localhost:3828/progress"}
)
]
)
target = EvaluationTarget(api_endpoint=api_endpoint)
# Run evaluation with full adapter pipeline
config = EvaluationConfig(
type="gsm8k",
output_dir="./results/gsm8k",
params=ConfigParams(
limit_samples=10,
temperature=0.0,
max_new_tokens=512,
parallelism=1
)
)
result = evaluate(eval_cfg=config, target_cfg=target)
print(f"Evaluation completed: {result}")
Available Interceptors:
system_message
: Add custom system prompts to chat requestsrequest_logging
: Log incoming requests for debuggingresponse_logging
: Log outgoing responses for debuggingcaching
: Cache responses to reduce API costs and speed up rerunsreasoning
: Extract chain-of-thought reasoning from model responsesprogress_tracking
: Track evaluation progress and send updates
For complete adapter documentation, refer to Usage.
Next Steps#
Integrate into your existing Python workflows
Run multiple benchmarks in sequence
Explore available evaluation types with
show_available_tasks()
Configure adapters and interceptors for advanced evaluation scenarios
Consider NeMo Evaluator Launcher for CLI workflows
Try Container Direct for containerized environments