Defaults Section#
The defaults section defines the default configuration and execution command that will be used across all evaluations unless overridden. Overriding is supported either through --overrides flag (refer to Parameter Overrides) or Run Configuration.
Command Template#
The command field uses Jinja2 templating to dynamically generate execution commands based on configuration parameters.
defaults:
command: >-
{% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %}
example_eval --model {{target.api_endpoint.model_id}}
--task {{config.params.task}}
--url {{target.api_endpoint.url}}
--temperature {{config.params.temperature}}
# ... additional parameters
Important Note: example_eval is a placeholder representing your actual CLI command. When onboarding your harness, replace this with your real command (e.g., lm-eval, bigcode-eval, gorilla-eval, etc.).
Template Variables#
Target API Endpoint Variables#
{{target.api_endpoint.api_key}}: Name of the environment variable storing API key{{target.api_endpoint.model_id}}: Target model identifier{{target.api_endpoint.stream}}: Whether responses should be streamed{{target.api_endpoint.type}}: The type of the target endpoint{{target.api_endpoint.url}}: URL of the model{{target.api_endpoint.adapter_config}}: Adapter configuration
Evaluation Configuration Variables#
{{config.output_dir}}: Output directory for results{{config.type}}: Type of the task{{config.supported_endpoint_types}}: Supported endpoint types (chat/completions)
Configuration Parameters#
{{config.params.task}}: Evaluation task type{{config.params.temperature}}: Model temperature setting{{config.params.limit_samples}}: Sample limit for evaluation{{config.params.max_new_tokens}}: Maximum tokens to generate{{config.params.max_retries}}: Number of REST request retries{{config.params.parallelism}}: Parallelism to be used{{config.params.request_timeout}}: REST response timeout{{config.params.top_p}}: Top-p sampling parameter{{config.params.extra}}: Framework-specific parameters
Configuration Defaults#
The following example shows common parameter defaults. Each framework defines its own default values in the framework.yml file.
defaults:
config:
params:
limit_samples: null # No limit on samples by default
max_new_tokens: 4096 # Maximum tokens to generate
temperature: 0.0 # Deterministic generation
top_p: 0.00001 # Nucleus sampling parameter
parallelism: 10 # Number of parallel requests
max_retries: 5 # Maximum API retry attempts
request_timeout: 60 # Request timeout in seconds
extra: # Framework-specific parameters
n_samples: null # Number of sampled responses per input
downsampling_ratio: null # Data downsampling ratio
add_system_prompt: false # Include system prompt
Parameter Categories#
Core Parameters#
Basic evaluation settings that control model behavior:
temperature: Controls randomness in generation (0.0 = deterministic)max_new_tokens: Maximum length of generated outputtop_p: Nucleus sampling parameter for diversity
Performance Parameters#
Settings that affect execution speed and reliability:
parallelism: Number of parallel API requestsrequest_timeout: Maximum wait time for API responsesmax_retries: Number of retry attempts for failed requests
Framework Parameters#
Task-specific configuration options:
task: Specific evaluation task to runlimit_samples: Limit number of samples for testing
Extra Parameters#
Custom parameters specific to your framework. Use it for:
specifying number of sampled responses per input query
judge configuration
configuring few-shot settings
Target Configuration#
defaults:
target:
api_endpoint:
type: chat # Default endpoint type
supported_endpoint_types: # All supported types
- chat
- completions
- vlm
- embedding
Endpoint Types#
chat: Multi-turn conversation format following the OpenAI chat completions API (/v1/chat/completions). Use this for models that support conversational interactions with role-based messages (system, user, assistant).
completions: Single-turn text completion format following the OpenAI completions API (/v1/completions). Use this for models that generate text based on a single prompt without conversation context. Often used for log-probability evaluations.
vlm: Vision-language model endpoints that support image inputs alongside text (/v1/chat/completions). Use this for multimodal evaluations that include visual content.
embedding: Embedding generation endpoints for retrieval and similarity evaluations (/v1/embeddings). Use this for tasks that require vector representations of text.