API Reference#
This document provides a comprehensive reference for the nemo-evaluator Python API.
Prerequisites#
Container way: Use evaluation containers mentioned in NeMo Evaluator Containers
Package way:
pip install nemo-evaluator
To run evaluations, you also need to install an evaluation framework package (for example,
nvidia-simple-evals):pip install nvidia-simple-evals
CLI vs. Programmatic Usage#
The NeMo Evaluator API supports two usage patterns:
CLI Usage (Recommended): Use
nemo-evaluator run_evalfunction which parses command line argumentsProgrammatic Usage: Use
evaluate()function with configuration objects
When to Use Which:
CLI: For command-line tools, scripts, and simple automation
Programmatic: For building custom applications, workflows, and integration with other systems
Available Data Classes#
The API provides several dataclasses for configuration:
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig, # Main evaluation configuration
EvaluationTarget, # Target model configuration
ConfigParams, # Evaluation parameters
ApiEndpoint, # API endpoint configuration
EvaluationResult, # Evaluation results
TaskResult, # Individual task results
MetricResult, # Metric scores
Score, # Score representation
ScoreStats, # Score statistics
GroupResult, # Grouped results
EndpointType, # Endpoint type enum
Evaluation # Complete evaluation object
)
run_eval#
The main entry point for running evaluations. This is a CLI entry point that parses command line arguments.
from nemo_evaluator.api.run import run_eval
def run_eval() -> None:
"""
CLI entry point for running evaluations.
This function parses command line arguments and executes evaluations.
It does not take parameters directly - all configuration is passed through CLI arguments.
CLI Arguments:
--eval_type: Type of evaluation to run (such as "mmlu_pro", "gpqa_diamond")
--model_id: Model identifier (such as "meta/llama-3.1-8b-instruct")
--model_url: API endpoint URL (such as "https://integrate.api.NVIDIA.com/v1/chat/completions" for chat endpoint type)
--model_type: Endpoint type ("chat", "completions", "vlm", "embedding")
--api_key_name: Environment variable name for API key integration with endpoints (optional)
--output_dir: Output directory for results
--run_config: Path to YAML Run Configuration file (optional)
--overrides: Comma-separated dot-style parameter overrides (optional)
--dry_run: Show rendered config without running (optional)
--debug: Enable debug logging (optional, deprecated, use NV_LOG_LEVEL=DEBUG env var)
Usage:
run_eval() # Parses sys.argv automatically
"""
Note
The run_eval() function is designed as a CLI entry point. For programmatic usage, use the underlying configuration objects and the evaluate() function directly.
evaluate#
The core evaluation function for programmatic usage.
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import EvaluationConfig, EvaluationTarget
def evaluate(
eval_cfg: EvaluationConfig,
target_cfg: EvaluationTarget
) -> EvaluationResult:
"""
Run an evaluation using configuration objects.
Args:
eval_cfg: Evaluation configuration object
target_cfg: Target configuration object
Returns:
EvaluationResult: Evaluation results and metadata
"""
Example Programmatic Usage:
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ConfigParams,
ApiEndpoint
)
# Create evaluation configuration
eval_config = EvaluationConfig(
type="simple_evals.mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=100,
temperature=0.1
)
)
# Create target configuration
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.NVIDIA.com/v1/chat/completions",
model_id="meta/llama-3.1-8b-instruct",
type="chat",
api_key="MY_API_KEY" # Name of the environment variable that stores api_key
)
)
# Run evaluation
result = evaluate(eval_config, target_config)
Data Structures#
EvaluationConfig#
Configuration for evaluation runs, defined in api_dataclasses.py.
from nemo_evaluator.api.api_dataclasses import EvaluationConfig
class EvaluationConfig:
"""Configuration for evaluation runs."""
type: str # Type of evaluation - benchmark to be run
output_dir: str # Output directory
params: str # parameter overrides
EvaluationTarget#
Target configuration for API endpoints, defined in api_dataclasses.py.
from nemo_evaluator.api.api_dataclasses import EvaluationTarget, EndpointType
class EvaluationTarget:
"""Target configuration for API endpoints."""
api_endpoint: ApiEndpoint # API endpoint to be used for evaluation
class ApiEndpoint:
url: str # API endpoint URL
model_id: str # Model name or identifier
type: str # Endpoint type (chat, completions, vlm, or embedding)
api_key: str # Name of the env variable that stores API key
adapter_config: AdapterConfig # Adapter configuration
In the ApiEndpoint dataclass, type should be one of: EndpointType.CHAT, EndpointType.COMPLETIONS, EndpointType.VLM, EndpointType.EMBEDDING:
- CHAT endpoint accepts structured input as a sequence of messages (such as system, user, assistant roles). It returns a model-generated message, enabling controlled multi-turn interactions.
- COMPLETIONS endpoint takes a single prompt string and returns a text continuation, typically used for one-shot or single-turn tasks without conversational structure.
- VLM endpoint hosts a model that has vision capabilities.
- EMBEDDING endpoint hosts an embedding model.
Adapter System#
AdapterConfig#
Configuration for the adapter system, defined in adapter_config.py.
from nemo_evaluator.adapters.adapter_config import AdapterConfig
class AdapterConfig:
"""Configuration for the adapter system."""
discovery: DiscoveryConfig # Module discovery configuration
interceptors: list[InterceptorConfig] # List of interceptors
post_eval_hooks: list[PostEvalHookConfig] # Post-evaluation hooks
endpoint_type: str # Default endpoint type
caching_dir: str | None # Legacy caching directory
InterceptorConfig#
Configuration for individual interceptors.
from nemo_evaluator.adapters.adapter_config import InterceptorConfig
class InterceptorConfig:
"""Configuration for a single interceptor."""
name: str # Interceptor name
enabled: bool # Whether enabled
config: dict[str, Any] # Interceptor-specific configuration
DiscoveryConfig#
Configuration for discovering third-party modules and directories.
from nemo_evaluator.adapters.adapter_config import DiscoveryConfig
class DiscoveryConfig:
"""Configuration for discovering 3rd party modules and directories."""
modules: list[str] # List of module paths to discover
dirs: list[str] # List of directory paths to discover
Available Interceptors#
1. Request Logging Interceptor#
from nemo_evaluator.adapters.interceptors.logging_interceptor import LoggingInterceptor
# Configuration
interceptor_config = {
"name": "request_logging",
"enabled": True,
"config": {
"output_dir": "/tmp/logs",
"max_requests": 1000,
"log_failed_requests": True
}
}
Features:
Logs all API requests and responses
Configurable output directory
Request/response count limits
Failed request logging
2. Caching Interceptor#
from nemo_evaluator.adapters.interceptors.caching_interceptor import CachingInterceptor
# Configuration
interceptor_config = {
"name": "caching",
"enabled": True,
"config": {
"cache_dir": "/tmp/cache",
"reuse_cached_responses": True,
"save_requests": True,
"save_responses": True,
"max_saved_requests": 1000,
"max_saved_responses": 1000
}
}
Features:
Response caching for performance
Persistent storage - responses are saved to disk, allowing resumption after process termination
Configurable cache directory
Request/response persistence
Cache size limits
3. Reasoning Interceptor#
from nemo_evaluator.adapters.interceptors.reasoning_interceptor import ReasoningInterceptor
# Configuration
interceptor_config = {
"name": "reasoning",
"enabled": True,
"config": {
"start_reasoning_token": "<think>",
"end_reasoning_token": "</think>",
"add_reasoning": True,
"enable_reasoning_tracking": True
}
}
Features:
Reasoning chain support
Custom reasoning tokens
Reasoning tracking and analysis
Chain-of-thought prompting
4. System Message Interceptor#
from nemo_evaluator.adapters.interceptors.system_message_interceptor import SystemMessageInterceptor
# Configuration
interceptor_config = {
"name": "system_message",
"enabled": True,
"config": {
"custom_system_prompt": "You are a helpful AI assistant.",
"override_existing": True
}
}
Features:
Custom system prompt injection
Prompt override capabilities
Consistent system behavior
Use Cases:
Modify system prompts for different evaluation scenarios
Test different prompt variations without code changes
Override existing system messages for consistent behavior
A/B testing of different prompt strategies
5. Endpoint Interceptor#
from nemo_evaluator.adapters.interceptors.endpoint_interceptor import EndpointInterceptor
# Configuration
interceptor_config = {
"name": "endpoint",
"enabled": True,
"config": {
"endpoint_url": "https://api.example.com/v1/chat/completions",
"timeout": 30
}
}
Features:
Endpoint URL management
Request timeout configuration
Endpoint validation
6. Payload Modifier Interceptor#
from nemo_evaluator.adapters.interceptors.payload_modifier_interceptor import PayloadModifierInterceptor
# Configuration
interceptor_config = {
"name": "payload_modifier",
"enabled": True,
"config": {
"params_to_add": {
"extra_body": {
"chat_template_kwargs": {
"enable_thinking": False
}
}
},
"params_to_remove": ["field_in_msgs_to_remove"],
"params_to_rename": {"max_tokens": "max_completion_tokens"}
}
}
Explanation:
This interceptor is particularly useful when custom behavior is needed. In this example, the enable_thinking parameter is a custom key that controls the reasoning mode of the model. When set to False, it disables the model’s internal reasoning/thinking process, which can be useful for scenarios where you want more direct responses without the model’s step-by-step reasoning output.
The field_in_msgs_to_remove field would be removed recursively from all messages in the payload.
Features:
Custom parameter injection
Remove fields recursively at all levels of the payload
Rename top-level payload keys
7. Client Error Interceptor#
from nemo_evaluator.adapters.interceptors.raise_client_error_interceptor import RaiseClientErrorInterceptor
# Configuration
interceptor_config = {
"name": "raise_client_error",
"enabled": True,
"config": {
"raise_on_error": True,
"error_threshold": 400
}
}
Features:
Error handling and propagation
Configurable error thresholds
Client error management