Evaluations Section#
The evaluations
section defines the specific evaluation types available in your framework, each with its own configuration defaults.
Structure#
evaluations:
- name: example_task_1 # Evaluation name
description: Basic functionality demo # Human-readable description
defaults:
config:
type: "example_task_1" # Evaluation identifier
supported_endpoint_types: # Supported endpoints for this task
- chat
- completions
params:
task: "example_task_1" # Task identifier used by the harness
temperature: 0.0 # Task-specific temperature
max_new_tokens: 1024 # Task-specific token limit
extra:
custom_key: "custom_value" # Task-specific custom param
Fields#
name#
Type: String
Required: Yes
Name for the evaluation type.
Example:
name: HumanEval
description#
Type: String
Required: Yes
Clear description of what the evaluation measures. This helps users understand the purpose and scope of the evaluation.
Example:
description: Evaluates code generation capabilities using the HumanEval benchmark dataset
type#
Type: String
Required: Yes
Unique configuration identifier used by the framework. This is used to reference the evaluation in CLI commands and configurations. This typically matches the name
field but may differ based on your framework’s conventions.
Example:
type: "humaneval"
supported_endpoint_types#
Type: List of strings
Required: Yes
API endpoint types compatible with this evaluation. Specify which endpoint types work with this evaluation task:
chat
- Conversational format with role-based messagescompletions
- Single-turn text completionvlm
- Vision-language model with image supportembedding
- Embedding generation for retrieval tasks
Example:
supported_endpoint_types:
- chat
- completions
params#
Type: Object
Required: No
Task-specific parameter overrides that differ from the framework-level defaults. Use this to customize settings for individual evaluation types.
Example:
params:
task: "humaneval"
temperature: 0.0
max_new_tokens: 1024
extra:
custom_key: "custom_value"
Multiple Evaluations#
You can define multiple evaluation types in a single FDF:
evaluations:
- name: humaneval
description: Code generation evaluation
defaults:
config:
type: "humaneval"
supported_endpoint_types:
- chat
- completions
params:
task: "humaneval"
max_new_tokens: 1024
- name: mbpp
description: Python programming evaluation
defaults:
config:
type: "mbpp"
supported_endpoint_types:
- chat
params:
task: "mbpp"
max_new_tokens: 512
Best Practices#
Use descriptive names that indicate the evaluation purpose
Provide comprehensive descriptions for each evaluation type
List endpoint types that are actually supported and tested
Override parameters when they differ from framework defaults
Use the
extra
object for framework-specific custom parametersGroup related evaluations together in the same FDF
Test each evaluation type with all specified endpoint types