Evaluations Section#

The evaluations section defines the specific evaluation types available in your framework, each with its own configuration defaults.

Structure#

evaluations:
  - name: example_task_1                    # Evaluation name
    description: Basic functionality demo   # Human-readable description
    defaults:
      config:
        type: "example_task_1"             # Evaluation identifier
        supported_endpoint_types:          # Supported endpoints for this task
          - chat
          - completions
        params:
          task: "example_task_1"           # Task identifier used by the harness
          temperature: 0.0                 # Task-specific temperature
          max_new_tokens: 1024             # Task-specific token limit
          extra:
            custom_key: "custom_value"     #  Task-specific custom param

Fields#

name#

Type: String
Required: Yes

Name for the evaluation type.

Example:

name: HumanEval

description#

Type: String
Required: Yes

Clear description of what the evaluation measures. This helps users understand the purpose and scope of the evaluation.

Example:

description: Evaluates code generation capabilities using the HumanEval benchmark dataset

type#

Type: String
Required: Yes

Unique configuration identifier used by the framework. This is used to reference the evaluation in CLI commands and configurations. This typically matches the name field but may differ based on your framework’s conventions.

Example:

type: "humaneval"

supported_endpoint_types#

Type: List of strings
Required: Yes

API endpoint types compatible with this evaluation. Specify which endpoint types work with this evaluation task:

chat - Conversational format with role-based messages
completions - Single-turn text completion
vlm - Vision-language model with image support
embedding - Embedding generation for retrieval tasks

Example:

supported_endpoint_types:
  - chat
  - completions

params#

Type: Object
Required: No

Task-specific parameter overrides that differ from the framework-level defaults. Use this to customize settings for individual evaluation types.

Example:

params:
  task: "humaneval"
  temperature: 0.0
  max_new_tokens: 1024
  extra:
    custom_key: "custom_value"

Multiple Evaluations#

You can define multiple evaluation types in a single FDF:

evaluations:
  - name: humaneval
    description: Code generation evaluation
    defaults:
      config:
        type: "humaneval"
        supported_endpoint_types:
          - chat
          - completions
        params:
          task: "humaneval"
          max_new_tokens: 1024

  - name: mbpp
    description: Python programming evaluation
    defaults:
      config:
        type: "mbpp"
        supported_endpoint_types:
          - chat
        params:
          task: "mbpp"
          max_new_tokens: 512

Best Practices#

Use descriptive names that indicate the evaluation purpose
Provide comprehensive descriptions for each evaluation type
List endpoint types that are actually supported and tested
Override parameters when they differ from framework defaults
Use the extra object for framework-specific custom parameters
Group related evaluations together in the same FDF
Test each evaluation type with all specified endpoint types