Evaluations Section#

The evaluations section defines the specific evaluation types available in your framework, each with its own configuration defaults.

Structure#

evaluations:
  - name: example_task_1                    # Evaluation name
    description: Basic functionality demo   # Human-readable description
    defaults:
      config:
        type: "example_task_1"             # Evaluation identifier
        supported_endpoint_types:          # Supported endpoints for this task
          - chat
          - completions
        params:
          task: "example_task_1"           # Task identifier used by the harness
          temperature: 0.0                 # Task-specific temperature
          max_new_tokens: 1024             # Task-specific token limit
          extra:
            custom_key: "custom_value"     #  Task-specific custom param

Fields#

name#

Type: String
Required: Yes

Name for the evaluation type.

Example:

name: HumanEval

description#

Type: String
Required: Yes

Clear description of what the evaluation measures. This helps users understand the purpose and scope of the evaluation.

Example:

description: Evaluates code generation capabilities using the HumanEval benchmark dataset

type#

Type: String
Required: Yes

Unique configuration identifier used by the framework. This is used to reference the evaluation in CLI commands and configurations. This typically matches the name field but may differ based on your framework’s conventions.

Example:

type: "humaneval"

supported_endpoint_types#

Type: List of strings
Required: Yes

API endpoint types compatible with this evaluation. Specify which endpoint types work with this evaluation task:

  • chat - Conversational format with role-based messages

  • completions - Single-turn text completion

  • vlm - Vision-language model with image support

  • embedding - Embedding generation for retrieval tasks

Example:

supported_endpoint_types:
  - chat
  - completions

params#

Type: Object
Required: No

Task-specific parameter overrides that differ from the framework-level defaults. Use this to customize settings for individual evaluation types.

Example:

params:
  task: "humaneval"
  temperature: 0.0
  max_new_tokens: 1024
  extra:
    custom_key: "custom_value"

Multiple Evaluations#

You can define multiple evaluation types in a single FDF:

evaluations:
  - name: humaneval
    description: Code generation evaluation
    defaults:
      config:
        type: "humaneval"
        supported_endpoint_types:
          - chat
          - completions
        params:
          task: "humaneval"
          max_new_tokens: 1024

  - name: mbpp
    description: Python programming evaluation
    defaults:
      config:
        type: "mbpp"
        supported_endpoint_types:
          - chat
        params:
          task: "mbpp"
          max_new_tokens: 512

Best Practices#

  • Use descriptive names that indicate the evaluation purpose

  • Provide comprehensive descriptions for each evaluation type

  • List endpoint types that are actually supported and tested

  • Override parameters when they differ from framework defaults

  • Use the extra object for framework-specific custom parameters

  • Group related evaluations together in the same FDF

  • Test each evaluation type with all specified endpoint types