Custom Evaluation Types#

Custom evaluation types provide maximum flexibility for evaluating your models on proprietary, domain-specific, or novel tasks. With custom evaluation, you can bring your own datasets, define your own prompts and templates, and select or implement the metrics that matter most for your use case. This approach is ideal when:

  • You want to evaluate on tasks, data, or formats not covered by academic benchmarks.

  • You need to measure model performance using custom or business-specific criteria.

  • You want to experiment with new evaluation methodologies, metrics, or workflows.

Custom Evaluation Comparison#

Evaluation

Use Case

Metrics

Example

Chat/Completion Tasks

Flexible chat/completion evaluation with custom prompts and metrics

BLEU, string-check, custom metrics

Evaluate Q&A, summarization, or custom chat flows

LLM-as-a-Judge

Use an LLM to score outputs for creativity, relevance, and other metrics.

LLM-judged scores (custom scale)

Evaluate creative or complex tasks with LLM-based scoring

Tool-Calling

Evaluate function/tool call accuracy (OpenAI-compatible)

Tool-calling accuracy

Evaluate function-calling or API tasks

Prerequisites#


Options#

Chat/Completion Tasks#

Custom chat/completion evaluation allows you to assess model performance on flexible conversational or completion-based tasks using your own prompts, templates, and metrics. This is ideal for Q&A, summarization, or any scenario where you want to evaluate how well a model generates responses to user inputs, beyond standard academic benchmarks. You can define the structure of the conversation, specify expected outputs, and use metrics like BLEU or string-check to measure quality.

{
  "type": "custom",
  "params": {
    "parallelism": 8
  },
  "tasks": {
    "qa": {
      "type": "completion",
      "params": {
        "template": {
          "messages": [
            {"role": "{{item.messages[0].role}}", "content": "{{item.messages[0].content}}"},
            {"role": "{{item.messages[1].role}}", "content": "{{item.messages[1].content}}"}
          ],
          "max_tokens": 20,
          "temperature": 0.7,
          "top_p": 0.9
        }
      },
      "metrics": {
        "bleu": {
          "type": "bleu",
          "params": {
            "references": ["{{item.messages[2].content | trim}}"]
          }
        },
        "string-check": {
          "type": "string-check",
          "params": {
            "check": [
              "{{item.messages[2].content}}",
              "equals",
              "{{output_text | trim}}"
            ]
          }
        }
      }
    }
  }
}
{
  "messages": [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "Paris"},
    {"role": "user", "content": "Correct!"}
  ]
}
{
  "tasks": {
    "qa": {
      "metrics": {
        "bleu": {
          "scores": {
            "sentence": {
              "value": 32.3,
              "stats": {
                "count": 200,
                "sum": 6460.66,
                "mean": 32.3
              }
            },
            "corpus": {
              "value": 14.0
            }
          }
        },
        "string-check": {
          "scores": {
            "string-check": {
              "value": 0.255,
              "stats": {
                "count": 200,
                "sum": 51.0,
                "mean": 0.255
              }
            }
          }
        }
      }
    }
  }
}

LLM-as-a-Judge#

Use another LLM to evaluate outputs. Flexible scoring criteria. Suitable for creative or complex tasks. Can be adapted for domain-specific evaluations. Does not support pairwise model comparisons; only single-mode evaluation.

{
    "type": "custom",
    "name": "my-configuration-llm-judge-1",
    "namespace": "my-organization",
    "tasks": {
        "my-task": {
            "type": "custom",
            "metrics": {
                "accuracy": {
                    "type": "llm-judge",
                    "params": {
                        "model": {
                            "api_endpoint": {
                                "url": "<my-judge-nim-url>",
                                "model_id": "<my-judge-model-id>"
                            }
                        },
                        "template": {
                            "messages": [
                                {
                                    "role": "system",
                                    "content": "Your task is to evaluate the semantic similarity between two responses."
                                },
                                {
                                    "role": "user",
                                    "content": "Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10. \n\nRESPONSE 1: {{item.reference_answer}}\n\nRESPONSE 2: {{sample.output_text}}.\n\n"
                                }
                            ]
                        },
                        "scores": {
                            "similarity": {
                                "type": "int",
                                "parser": {
                                    "type": "regex",
                                    "pattern": "SIMILARITY: (\\d)"
                                }
                            }
                        }
                    }
                }
            },
            "dataset": {
                "files_url": "hf://datasets/default/<my-dataset>"
            }
        }
    }
}
{
  "reference_answer": "Paris",
  "output_text": "Paris"
}
{
  "tasks": {
    "my-task": {
      "metrics": {
        "accuracy": {
          "scores": {
            "similarity": {
              "value": 7
            }
          }
        }
      }
    }
  }
}

Download Results#

The following Python script can be used to download the generated results:

import huggingface_hub as hh
import requests

url = "<NeMo Data Store URL>"
token = "mock"
repo_name = "<evaluation id>"
download_path = "<Path where results will be downloaded>"

repo_name = f'nvidia/{repo_name}'

api = hh.HfApi(endpoint=url, token=token)
repo_type = 'dataset'
api.snapshot_download(repo_id=repo_name, repo_type=repo_type, local_dir=download_path, local_dir_use_symlinks=False)

Results Directory Structure#

results/
├── mt_bench/
│   ├── model_answer/           # User LLM responses for each prompt
│   │   └── <llm_name>.jsonl
│   ├── model_judgment/         # Judge LLM ratings and explanations for each answer
│   │   └── <llm_name>.jsonl
│   ├── reference_answer/       # Reference/context for each question (for custom judging)
│   │   └── <reference>.jsonl
│   ├── question.jsonl          # List of questions used in the evaluation
│   └── judge_prompts.jsonl     # Prompts used for the judge LLM
└── results/
    └── <llm_name>.csv         # Aggregated scores for each evaluation category

Aggregated Scores CSV#

Category

Description

total

Average score across all categories

humanities

Score for humanities-related questions

reasoning

Score for reasoning-related questions

writing

Score for writing-related questions

coding

Score for coding-related questions

stem

Score for STEM-related questions

roleplay

Score for roleplay-related questions

math

Score for math-related questions

extraction

Score for extraction-related questions

turn 1

Average score for the first turn

turn 2

Average score for the second turn

  • Each row in the CSV describes a score from 1 to 10, where 1 is the weakest and 10 is the strongest for a given category.

  • For custom evaluations, categories will match those in your custom dataset.

Judge LLM Output Format

The Judge LLM must provide ratings in the format [[rating]]. If the required format is not followed, a warning will appear in the .csv file. Adjust inference parameters or use a different Judge LLM if needed.

Custom Dataset Format#

question.jsonl#

For LLM-as-a-judge, the question.jsonl file contains questions to be evaluated by the LLM judge. Each line in this file represents a single question with its metadata, including a unique identifier, category, and the conversation turns.

Field

Type

Required

Description

question_id

integer

Yes

Unique identifier for the question.

category

string

Yes

Category or topic of the question (e.g., ‘math’, ‘general’).

turns

list of strings

Yes

List of user turns (questions or conversation turns). For single-turn, use a single-element list.

{"question_id": 1, "category": "general", "turns": ["What is the capital of France?"]}
{"question_id": 2, "category": "math", "turns": ["What is 2+2?"]}
judge_prompts.jsonl#

For LLM-as-a-judge, the judge_prompts.jsonl file contains the prompt templates used by the LLM judge to evaluate model responses. Each line in this file represents a different prompt configuration with system instructions and templates.

Field

Type

Required

Description

name

string

Yes

Name of the prompt template (e.g., ‘single-v1’).

type

string

Yes

Type of prompt (e.g., ‘single’).

system_prompt

string

Yes

System message for the judge LLM (instructions for judging).

prompt_template

string

Yes

Template for the user prompt, with placeholders for question, answer, etc.

description

string

No

Description of the prompt’s intended use.

category

string or list of strings

No

Category or categories this prompt applies to.

output_format

string

Yes

Required output format for the judge LLM (e.g., ‘[[rating]]’).

{"name": "single-v1", "type": "single", "system_prompt": "You are a helpful assistant.", "prompt_template": "[Instruction]\nPlease act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"Rating: [[5]]\".\n\n[Question]\n{question}\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]", "description": "Prompt for general questions", "category": "general", "output_format": "[[rating]]"}
reference.jsonl#

For LLM-as-a-judge, the reference.jsonl file contains reference answers or ground truth for questions. This file is optional but useful for evaluations where you want to compare model responses against known correct answers.

Field

Type

Required

Description

question_id

integer or string

Yes

The question_id this reference is associated with.

choices

list of objects

Yes

List of reference answers or context objects. Each object typically has an index and a turns list (reference answers or context for each turn).

{"question_id": 1, "choices": [{"index": 0, "turns": ["Paris"]}]}
{"question_id": 2, "choices": [{"index": 0, "turns": ["4"]}]}

Tool-Calling#

Evaluate accuracy of function/tool calls. Compare against ground truth calls. Supports OpenAI-compatible function calling format.

{
    "type": "custom",
    "name": "my-configuration-tool-calling-1",
    "namespace": "my-organization",
    "tasks": {
        "my-tool-calling-task": {
            "type": "custom",
            "metrics": {
                "tool-calling-accuracy": {
                    "type": "tool-calling",
                    "params": {
                        "tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"
                    }
                }
            },
            "dataset": {
                "files_url": "hf://datasets/default/<my-dataset>"
            }
        }
    }
}
{
  "messages": [
    {"role": "user", "content": "Book a table for 2 at 7pm."},
    {"role": "assistant", "content": "Booking a table...", "tool_calls": [{"name": "book_table", "args": {"people": 2, "time": "7pm"}}]}
  ],
  "tool_calls": [
    {"name": "book_table", "args": {"people": 2, "time": "7pm"}}
  ]
}
{
  "tasks": {
    "my-tool-calling-task": {
      "metrics": {
        "tool-calling-accuracy": {
          "scores": {
            "function_name_accuracy": {
              "value": 1.0
            },
            "function_name_and_args_accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}