API Reference#
Access the complete NeMo Evaluator Python API through this comprehensive reference guide.
Core API Functions#
Choose from multiple API layers based on your needs:
API Layers#
Core Evaluation API (
nemo_evaluator.core.evaluate
): Direct evaluation with full adapter supportHigh-level API (
nemo_evaluator.api.run
): Simplified interface for common workflowsCLI Interface (
nemo_evaluator.cli
): Command-line evaluation tools
When to Use Each Layer#
Core API: Maximum flexibility, custom interceptors, integration into ML pipelines
High-level API: Standard evaluations with adapter configuration
CLI: Quick evaluations, scripting, and automation
Available Dataclasses#
Configure your evaluations using these dataclasses:
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig, # Main evaluation configuration
EvaluationTarget, # Target model configuration
ConfigParams, # Evaluation parameters
ApiEndpoint, # API endpoint configuration
EvaluationResult, # Evaluation results
TaskResult, # Individual task results
MetricResult, # Metric scores
Score, # Score representation
ScoreStats, # Score statistics
GroupResult, # Grouped results
EndpointType, # Endpoint type enum
Evaluation # Complete evaluation object
)
Core Evaluation API#
run_eval
#
CLI entry point for running evaluations. This function parses command line arguments.
from nemo_evaluator.api.run import run_eval
def run_eval() -> None:
"""
CLI entry point for running evaluations.
This function parses command line arguments and executes evaluations.
It does not take parameters directly - all configuration is passed via CLI arguments.
CLI Arguments:
--eval_type: Type of evaluation to run (e.g., "mmlu_pro", "gsm8k")
--model_id: Model identifier (e.g "meta/llama-3.1-8b-instruct")
--model_url: API endpoint URL (e.g "https://integrate.api.nvidia.com/v1/chat/completions" for chat endpoint type)
--model_type: Endpoint type ("chat", "completions", "vlm", "embedding")
--api_key_name: Environment variable name for API key integration with endpoints (optional)
--output_dir: Output directory for results
--run_config: Path to YAML Run Configuration file (optional)
--overrides: Comma-separated dot-style parameter overrides (optional)
--dry_run: Show rendered config without running (optional)
--debug: Enable debug logging (optional, deprecated, use NV_LOG_LEVEL=DEBUG env var)
Usage:
run_eval() # Parses sys.argv automatically
"""
Note
The run_eval()
function is designed as a CLI entry point. For programmatic usage, you should use the evaluate()
function directly with configuration objects.
evaluate
#
The core evaluation function for programmatic usage.
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import EvaluationConfig, EvaluationTarget
def evaluate(
eval_cfg: EvaluationConfig,
target_cfg: EvaluationTarget
) -> EvaluationResult:
"""
Run an evaluation using configuration objects.
Args:
eval_cfg: Evaluation configuration object containing output directory,
parameters, and evaluation type
target_cfg: Target configuration object containing API endpoint details
and adapter configuration
Returns:
EvaluationResult: Evaluation results and metadata
"""
Prerequisites:
Container way: Use simple-evals container mentioned in NeMo Evaluator Containers
Python way:
pip install nemo-evaluator nvidia-simple-evals
Example Programmatic Usage:
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ConfigParams,
ApiEndpoint,
EndpointType
)
# Create evaluation configuration
eval_config = EvaluationConfig(
type="simple_evals.mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=100,
temperature=0.1
)
)
# Create target configuration
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.nvidia.com/v1/chat/completions",
model_id="meta/llama-3.1-8b-instruct",
type=EndpointType.CHAT,
api_key="your_api_key_here"
)
)
# Run evaluation
result = evaluate(eval_config, target_config)
Data Structures#
EvaluationConfig
#
Configuration for evaluation runs, defined in api_dataclasses.py
.
from nemo_evaluator.api.api_dataclasses import EvaluationConfig, ConfigParams
class EvaluationConfig:
"""Configuration for evaluation runs."""
output_dir: Optional[str] # Directory to output results
params: Optional[ConfigParams] # Evaluation parameters
supported_endpoint_types: Optional[list[str]] # Supported endpoint types
type: Optional[str] # Type of evaluation task
ConfigParams
#
Parameters for evaluation execution.
from nemo_evaluator.api.api_dataclasses import ConfigParams
class ConfigParams:
"""Parameters for evaluation execution."""
limit_samples: Optional[int | float] # Limit number of evaluation samples
max_new_tokens: Optional[int] # Maximum tokens to generate
max_retries: Optional[int] # Number of REST request retries
parallelism: Optional[int] # Parallelism level
task: Optional[str] # Name of the task
temperature: Optional[float] # Sampling temperature (0.0-1.0)
request_timeout: Optional[int] # REST response timeout
top_p: Optional[float] # Top-p sampling parameter (0.0-1.0)
extra: Optional[Dict[str, Any]] # Framework-specific parameters
EvaluationTarget
#
Target configuration for API endpoints, defined in api_dataclasses.py
.
from nemo_evaluator.api.api_dataclasses import EvaluationTarget, ApiEndpoint
class EvaluationTarget:
"""Target configuration for API endpoints."""
api_endpoint: Optional[ApiEndpoint] # API endpoint configuration
class ApiEndpoint:
"""API endpoint configuration."""
api_key: Optional[str] # API key or env variable name
model_id: Optional[str] # Model identifier
stream: Optional[bool] # Whether to stream responses
type: Optional[EndpointType] # Endpoint type (chat, completions, vlm, embedding)
url: Optional[str] # API endpoint URL
adapter_config: Optional[AdapterConfig] # Adapter configuration
Adapter System#
AdapterConfig
#
Configuration for the adapter system, defined in adapter_config.py
.
from nemo_evaluator.adapters.adapter_config import AdapterConfig
class AdapterConfig:
"""Configuration for the adapter system."""
discovery: DiscoveryConfig # Module discovery configuration
interceptors: list[InterceptorConfig] # List of interceptors
post_eval_hooks: list[PostEvalHookConfig] # Post-evaluation hooks
endpoint_type: str # Type of endpoint (default: "chat")
caching_dir: str | None # Legacy field (deprecated, use caching interceptor)
generate_html_report: bool # Whether to generate HTML report (default: True)
log_failed_requests: bool # Whether to log failed requests (default: False)
tracking_requests_stats: bool # Enable request statistics tracking (default: True)
html_report_size: int | None # Number of request-response pairs in HTML report (default: 5)
InterceptorConfig
#
Configuration for individual interceptors.
from nemo_evaluator.adapters.adapter_config import InterceptorConfig
class InterceptorConfig:
"""Configuration for a single interceptor."""
name: str # Interceptor name
enabled: bool # Whether enabled
config: dict[str, Any] # Interceptor-specific configuration
DiscoveryConfig
#
Configuration for discovering third-party modules and directories.
from nemo_evaluator.adapters.adapter_config import DiscoveryConfig
class DiscoveryConfig:
"""Configuration for discovering 3rd party modules and directories."""
modules: list[str] # List of module paths to discover
dirs: list[str] # List of directory paths to discover
Available Interceptors#
1. Request Logging Interceptor#
from nemo_evaluator.adapters.interceptors.logging_interceptor import RequestLoggingInterceptor
# Configuration
interceptor_config = {
"name": "request_logging",
"enabled": True,
"config": {
"max_requests": 2,
"log_request_body": True,
"log_request_headers": True
}
}
Features:
Logs incoming API requests
Configurable request count limit
Optional request body logging
Optional request headers logging
2. Caching Interceptor#
from nemo_evaluator.adapters.interceptors.caching_interceptor import CachingInterceptor
# Configuration
interceptor_config = {
"name": "caching",
"enabled": True,
"config": {
"cache_dir": "/tmp/cache",
"reuse_cached_responses": False,
"save_requests": False,
"save_responses": True,
"max_saved_requests": None,
"max_saved_responses": None
}
}
Features:
Response caching for performance
Configurable cache directory
Optional request/response persistence
Optional cache size limits
3. Reasoning Interceptor#
from nemo_evaluator.adapters.interceptors.reasoning_interceptor import ResponseReasoningInterceptor
# Configuration
interceptor_config = {
"name": "reasoning",
"enabled": True,
"config": {
"start_reasoning_token": "<think>",
"end_reasoning_token": "</think>",
"add_reasoning": True,
"migrate_reasoning_content": False,
"enable_reasoning_tracking": True,
"include_if_not_finished": True,
"stats_file_saving_interval": None,
"enable_caching": True,
"cache_dir": "/tmp/reasoning_interceptor",
"logging_aggregated_stats_interval": 100
}
}
Features:
Processes reasoning content in responses
Detects and removes reasoning tokens
Tracks reasoning statistics
Optional extraction of reasoning to separate fields
Caching support for interrupted runs
4. System Message Interceptor#
from nemo_evaluator.adapters.interceptors.system_message_interceptor import SystemMessageInterceptor
# Configuration
interceptor_config = {
"name": "system_message",
"enabled": True,
"config": {
"system_message": "You are a helpful AI assistant."
}
}
Features:
Adds system message to requests
For chat endpoints: adds as system role message
For completions endpoints: prepends to the prompt
5. Endpoint Interceptor#
from nemo_evaluator.adapters.interceptors.endpoint_interceptor import EndpointInterceptor
# Configuration
interceptor_config = {
"name": "endpoint",
"enabled": True,
"config": {} # No configurable parameters
}
Features:
Makes actual HTTP requests to upstream API
Automatically added as final interceptor in chain
No user-configurable parameters
6. Progress Tracking Interceptor#
from nemo_evaluator.adapters.interceptors.progress_tracking_interceptor import ProgressTrackingInterceptor
# Configuration
interceptor_config = {
"name": "progress_tracking",
"enabled": True,
"config": {
"progress_tracking_url": "http://localhost:8000",
"progress_tracking_interval": 1,
"request_method": "PATCH",
"output_dir": None
}
}
Features:
Tracks number of samples processed via webhook
Configurable tracking URL and interval
Optional local file tracking
Configurable HTTP request method
7. Payload Modifier Interceptor#
from nemo_evaluator.adapters.interceptors.payload_modifier_interceptor import PayloadParamsModifierInterceptor
# Configuration
interceptor_config = {
"name": "payload_modifier",
"enabled": True,
"config": {
"params_to_remove": None,
"params_to_add": {
"extra_body": {
"chat_template_kwargs": {
"enable_thinking": False
}
}
},
"params_to_rename": None
}
}
Features:
Modifies request payload
Can remove, add, or rename parameters
Supports nested parameter structures
8. Client Error Interceptor#
from nemo_evaluator.adapters.interceptors.raise_client_error_interceptor import RaiseClientErrorInterceptor
# Configuration
interceptor_config = {
"name": "raise_client_errors",
"enabled": True,
"config": {
"exclude_status_codes": [408, 429],
"status_codes": None,
"status_code_range_start": 400,
"status_code_range_end": 499
}
}
Features:
Raises exceptions on client errors (4xx status codes)
Configurable status code ranges
Can exclude specific status codes (like 408, 429)
Stops evaluation on non-retryable errors
Configuration Examples#
Basic Framework Configuration#
framework:
name: mmlu_pro
defaults:
config:
params:
limit_samples: 100
max_tokens: 512
temperature: 0.1
target:
api_endpoint:
adapter_config:
interceptors:
- name: "request_logging"
enabled: true
config:
output_dir: "./logs"
- name: "caching"
enabled: true
config:
cache_dir: "./cache"
Advanced Adapter Configuration#
framework:
name: advanced_eval
defaults:
target:
api_endpoint:
adapter_config:
discovery:
modules: ["custom.interceptors", "my.package"]
dirs: ["/path/to/custom/interceptors"]
interceptors:
- name: "request_logging"
enabled: true
config:
max_requests: 50
log_request_body: true
log_request_headers: true
- name: "caching"
enabled: true
config:
cache_dir: "./cache"
reuse_cached_responses: true
- name: "reasoning"
enabled: true
config:
start_reasoning_token: "<think>"
end_reasoning_token: "</think>"
add_reasoning: true
enable_reasoning_tracking: true
- name: "progress_tracking"
enabled: true
config:
progress_tracking_url: "http://localhost:8000"
progress_tracking_interval: 1
post_eval_hooks:
- name: "custom_analysis"
enabled: true
config:
analysis_type: "detailed"
endpoint_type: "chat"
Interceptor System#
The NeMo Evaluator uses an interceptor-based architecture that processes requests and responses through a configurable chain of components. Interceptors can modify requests, responses, or both, and can be enabled/disabled and configured independently.
Configuration Methods#
You can configure interceptors using two primary approaches:
CLI Overrides: Use the
--overrides
parameter for runtime configurationYAML Configuration: Define interceptor chains in configuration files
Configure Interceptors#
Refer to Interceptors for details.
Complete Configuration Example#
Here’s a complete example combining multiple interceptors:
YAML Configuration:
target:
api_endpoint:
adapter_config:
interceptors:
- name: "request_logging"
enabled: true
config:
max_requests: 50
log_request_body: true
log_request_headers: true
- name: "caching"
enabled: true
config:
cache_dir: "./cache"
reuse_cached_responses: true
save_requests: true
save_responses: true
- name: "endpoint"
enabled: true
- name: "response_logging"
enabled: true
config:
max_responses: 50
post_eval_hooks: []
To use the above, save it as config.yaml
and run:
eval-factory run_eval \
--eval_type mmlu_pro \
--model_id meta/llama-3.1-8b-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat \
--api_key_name MY_API_KEY \
--output_dir ./results \
--run_config config.yaml
Interceptor Chain Order#
Interceptors are executed in the order they appear in the configuration. The order matters because:
Request interceptors process requests sequentially before sending to the endpoint
Response interceptors process responses sequentially after receiving from the endpoint
A typical order is:
system_message
- Add/modify system promptspayload_modifier
- Modify request parametersrequest_logging
- Log the requestcaching
- Check cache before making requestendpoint
- Make the actual API call (automatically added)response_logging
- Log the responsereasoning
- Process reasoning tokensprogress_tracking
- Track evaluation progress