Custom Evaluation Types#

Custom evaluation types provide maximum flexibility for evaluating your models on proprietary, domain-specific, or novel tasks. With custom evaluation, you can bring your own datasets, define your own prompts and templates, and select or implement the metrics that matter most for your use case. This approach is ideal when:

You want to evaluate on tasks, data, or formats not covered by academic benchmarks.
You need to measure model performance using custom or business-specific criteria.
You want to experiment with new evaluation methodologies, metrics, or workflows.

Custom Evaluation Comparison#
Evaluation	Use Case	Metrics	Example
Chat/Completion Tasks	Flexible chat/completion evaluation with custom prompts and metrics	BLEU, string-check, custom metrics	Evaluate Q&A, summarization, or custom chat flows
LLM-as-a-Judge	Use an LLM to score outputs for creativity, relevance, and other metrics.	LLM-judged scores (custom scale)	Evaluate creative or complex tasks with LLM-based scoring
Tool-Calling	Evaluate function/tool call accuracy (OpenAI-compatible)	Tool-calling accuracy	Evaluate function-calling or API tasks

Prerequisites#

Set up or select an existing evaluation target.

Chat/Completion Tasks#

Custom chat/completion evaluation allows you to assess model performance on flexible conversational or completion-based tasks using your own prompts, templates, and metrics. This is ideal for Q&A, summarization, or any scenario where you want to evaluate how well a model generates responses to user inputs, beyond standard academic benchmarks. You can define the structure of the conversation, specify expected outputs, and use metrics like BLEU or string-check to measure quality.

Config

{
  "type": "custom",
  "params": {
    "parallelism": 8
  },
  "tasks": {
    "qa": {
      "type": "chat-completion",
      "params": {
        "template": {
          "messages": [
            {"role": "user", "content": "{{item.question}}"},
            {"role": "assistant", "content": "{{item.answer}}"}
          ],
          "max_tokens": 20,
          "temperature": 0.7,
          "top_p": 0.9
        }
      },
      "metrics": {
        "bleu": {
          "type": "bleu",
          "params": {
            "references": ["{{item.reference_answer | trim}}"]
          }
        },
        "rouge": {
          "type": "rouge",
          "params": {
            "ground_truth": "{{item.reference_answer | trim}}"
          }
        },
        "string-check": {
          "type": "string-check",
          "params": {
            "check": [
              "{{item.reference_answer | trim}}",
              "equals",
              "{{output_text | trim}}"
            ]
          }
        },
        "f1": {
            "type": "f1",
            "params": {
                "ground_truth": "{{item.reference_answer | trim}}"
            }
        },
        "em": {
            "type": "em",
            "params": {
                "ground_truth": "{{item.reference_answer | trim}}"
            }
        }
      },
      "dataset": {
        "files_url": "hf://datasets/default/<my-dataset>"
      }
    }
  }
}

Data Format

"question","answer","reference_answer"
"What is the capital of France?","Paris","The answer is Paris"
"What is 2+2?","4","The answer is 4"
"Square root of 256?","16","The answer is 16"

Result

{
  "tasks": {
    "qa": {
      "metrics": {
        "bleu": {
          "scores": {
            "sentence": {
              "value": 32.3,
              "stats": {
                "count": 200,
                "sum": 6460.66,
                "mean": 32.3
              }
            },
            "corpus": {
              "value": 14.0
            }
          }
        },
        "rouge": {
          "scores": {
            "rouge_1_score": {
              "value": 0.238671638808714,
              "stats": {
                "count": 10,
                "sum": 2.38671638808714,
                "mean": 0.238671638808714
              }
            },
            "rouge_2_score": {
              "value": 0.14953146173038,
              "stats": {
                "count": 10,
                "sum": 1.4953146173038,
                "mean": 0.14953146173038
              }
            },
            "rouge_3_score": {
              "value": 0.118334587614537,
              "stats": {
                "count": 10,
                "sum": 1.18334587614537,
                "mean": 0.118334587614537
              }
            },
            "rouge_L_score": {
              "value": 0.198059156106409,
              "stats": {
                "count": 10,
                "sum": 1.98059156106409,
                "mean": 0.198059156106409
              }
            }
          }
        },
        "string-check": {
          "scores": {
            "string-check": {
              "value": 0.255,
              "stats": {
                "count": 200,
                "sum": 51.0,
                "mean": 0.255
              }
            }
          }
        },
        "f1": {
          "scores": {
            "f1_score": {
              "value": 0.226293156870275,
              "stats": {
                "count": 10,
                "sum": 2.26293156870275,
                "mean": 0.226293156870275
              }
            }
          }
        },
        "em": {
          "scores": {
            "em_score": {
              "value": 0,
              "stats": {
                "count": 10,
                "sum": 0,
                "mean": 0
              }
            }
          }
        }
      }
    }
  }
}

LLM-as-a-Judge#

Use another LLM to evaluate outputs. Flexible scoring criteria. Suitable for creative or complex tasks. Can be adapted for domain-specific evaluations. Does not support pairwise model comparisons; only single-mode evaluation. For controlling reasoning behavior with judge models, see Advanced Reasoning.

Task Types for LLM-as-a-Judge#

LLM-as-a-Judge can be used with three different task types:

data - Evaluate existing prompt/response pairs directly (no model inference needed). Use this when you already have model outputs and want to judge them.
completion - Generate completions from a target model, then evaluate with an LLM judge. Use this for completion-style tasks where you want to prompt a model and then judge the outputs.
chat-completion - Generate chat responses from a target model, then evaluate with an LLM judge. Use this for conversational tasks where you want to prompt a model in chat format and then judge the responses.

Choose data when you already have model outputs to evaluate, or completion/chat-completion when you need to generate new outputs from a target model first.

Data Task Type

Use when you have existing prompt/response pairs to evaluate directly.

{
    "type": "custom",
    "name": "my-configuration-llm-judge-data",
    "namespace": "my-organization",
    "tasks": {
        "my-data-task": {
            "type": "data",
            "metrics": {
                "accuracy": {
                    "type": "llm-judge",
                    "params": {
                        "model": {
                            "api_endpoint": {
                                "url": "<my-judge-nim-url>",
                                "model_id": "<my-judge-model-id>"
                            }
                        },
                        "template": {
                            "messages": [
                                {
                                    "role": "system",
                                    "content": "Your task is to evaluate the semantic similarity between two responses."
                                },
                                {
                                    "role": "user",
                                    "content": "Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10. \n\nRESPONSE 1: {{item.reference_answer}}\n\nRESPONSE 2: {{item.model_output}}.\n\n"
                                }
                            ]
                        },
                        "scores": {
                            "similarity": {
                                "type": "int",
                                "parser": {
                                    "type": "regex",
                                    "pattern": "SIMILARITY: (\\d+)"
                                }
                            }
                        }
                    }
                }
            },
            "dataset": {
                "files_url": "hf://datasets/default/<my-dataset>"
            }
        }
    }
}

Completion Task Type

Use when you want to generate completions from a target model and then judge them.

{
    "type": "custom",
    "name": "my-configuration-llm-judge-completion",
    "namespace": "my-organization",
    "tasks": {
        "my-completion-task": {
            "type": "completion",
            "params": {
                "template": "Answer this question: {{item.question}}\nAnswer:"
            },
            "metrics": {
                "quality": {
                    "type": "llm-judge",
                    "params": {
                        "model": {
                            "api_endpoint": {
                                "url": "<my-judge-nim-url>",
                                "model_id": "<my-judge-model-id>"
                            }
                        },
                        "template": {
                            "messages": [
                                {
                                    "role": "system",
                                    "content": "Your task is to evaluate the quality of an answer to a question."
                                },
                                {
                                    "role": "user",
                                    "content": "Rate the quality from 1-10. Format: QUALITY: X\n\nQUESTION: {{item.question}}\nANSWER: {{output_text}}\nEXPECTED: {{item.expected_answer}}"
                                }
                            ]
                        },
                        "scores": {
                            "quality": {
                                "type": "int",
                                "parser": {
                                    "type": "regex",
                                    "pattern": "QUALITY: (\\d+)"
                                }
                            }
                        }
                    }
                }
            },
            "dataset": {
                "files_url": "hf://datasets/default/<my-dataset>"
            }
        }
    }
}

Chat-Completion Task Type

Use when you want to generate chat responses from a target model and then judge them.

{
    "type": "custom",
    "name": "my-configuration-llm-judge-chat",
    "namespace": "my-organization",
    "tasks": {
        "my-chat-task": {
            "type": "chat-completion",
            "params": {
                "template": {
                    "messages": [
                        {
                            "role": "system",
                            "content": "You are a helpful assistant."
                        },
                        {
                            "role": "user",
                            "content": "{{item.user_message}}"
                        }
                    ]
                }
            },
            "metrics": {
                "helpfulness": {
                    "type": "llm-judge",
                    "params": {
                        "model": {
                            "api_endpoint": {
                                "url": "<my-judge-nim-url>",
                                "model_id": "<my-judge-model-id>"
                            }
                        },
                        "template": {
                            "messages": [
                                {
                                    "role": "system",
                                    "content": "Your task is to evaluate how helpful an assistant's response is."
                                },
                                {
                                    "role": "user",
                                    "content": "Rate helpfulness from 1-5. Format: HELPFUL: X\n\nUSER: {{item.user_message}}\nASSISTANT: {{output_text}}"
                                }
                            ]
                        },
                        "scores": {
                            "helpfulness": {
                                "type": "int",
                                "parser": {
                                    "type": "regex",
                                    "pattern": "HELPFUL: (\\d)"
                                }
                            }
                        }
                    }
                }
            },
            "dataset": {
                "files_url": "hf://datasets/default/<my-dataset>"
            }
        }
    }
}

Data Format Examples

For Data Task Type:

{
  "reference_answer": "Paris",
  "model_output": "The capital of France is Paris"
}

For Completion Task Type:

{
  "question": "What is the capital of France?",
  "expected_answer": "Paris"
}

For Chat-Completion Task Type:

{
  "user_message": "What is the capital of France?"
}

Result Examples

Data Task Type Result:

{
  "tasks": {
    "my-data-task": {
      "metrics": {
        "accuracy": {
          "scores": {
            "similarity": {
              "value": 8,
              "stats": {
                "count": 100,
                "mean": 7.5,
                "min": 3,
                "max": 10
              }
            }
          }
        }
      }
    }
  }
}

Completion Task Type Result:

{
  "tasks": {
    "my-completion-task": {
      "metrics": {
        "quality": {
          "scores": {
            "quality": {
              "value": 8,
              "stats": {
                "count": 50,
                "mean": 7.2,
                "min": 4,
                "max": 10
              }
            }
          }
        }
      }
    }
  }
}

Chat-Completion Task Type Result:

{
  "tasks": {
    "my-chat-task": {
      "metrics": {
        "helpfulness": {
          "scores": {
            "helpfulness": {
              "value": 4,
              "stats": {
                "count": 75,
                "mean": 4.1,
                "min": 2,
                "max": 5
              }
            }
          }
        }
      }
    }
  }
}

Download Results#

The following Python script can be used to download the generated results:

import huggingface_hub as hh
import requests

url = "<NeMo Data Store URL>"
token = "mock"
repo_name = "<evaluation id>"
download_path = "<Path where results will be downloaded>"

repo_name = f'nvidia/{repo_name}'

api = hh.HfApi(endpoint=url, token=token)
repo_type = 'dataset'
api.snapshot_download(repo_id=repo_name, repo_type=repo_type, local_dir=download_path, local_dir_use_symlinks=False)

Results Directory Structure#

results/
├── results.yml                 # Main evaluation results file
├── requests.json               # Detailed request logs for LLM-as-a-Judge calls
└── metadata.json               # Job metadata and configuration

Results Files Description#

results.yml: Contains the main evaluation results in the standardized format with tasks, metrics, and scores
requests.json: Contains detailed logs of all LLM requests and responses made during the evaluation, useful for debugging and analysis
metadata.json: Contains job configuration, timestamps, and other metadata about the evaluation run

Judge LLM Output Format

The Judge LLM must provide ratings in the format [[rating]]. If the required format is not followed, a warning will appear in the .csv file. Adjust inference parameters or use a different Judge LLM if needed.

Custom Dataset Format#

question.jsonl#

Fields

For LLM-as-a-judge, the question.jsonl file contains questions to be evaluated by the LLM judge. Each line in this file represents a single question with its metadata, including a unique identifier, category, and the conversation turns.

Field	Type	Required	Description
`question_id`	integer	Yes	Unique identifier for the question.
`category`	string	Yes	Category or topic of the question (e.g., ‘math’, ‘general’).
`turns`	list of strings	Yes	List of user turns (questions or conversation turns). For single-turn, use a single-element list.

Example

{"question_id": 1, "category": "general", "turns": ["What is the capital of France?"]}
{"question_id": 2, "category": "math", "turns": ["What is 2+2?"]}

judge_prompts.jsonl#

Fields

For LLM-as-a-judge, the judge_prompts.jsonl file contains the prompt templates used by the LLM judge to evaluate model responses. Each line in this file represents a different prompt configuration with system instructions and templates.

Field	Type	Required	Description
`name`	string	Yes	Name of the prompt template (e.g., ‘single-v1’).
`type`	string	Yes	Type of prompt (e.g., ‘single’).
`system_prompt`	string	Yes	System message for the judge LLM (instructions for judging).
`prompt_template`	string	Yes	Template for the user prompt, with placeholders for question, answer, etc.
`description`	string	No	Description of the prompt’s intended use.
`category`	string or list of strings	No	Category or categories this prompt applies to.
`output_format`	string	Yes	Required output format for the judge LLM (e.g., ‘[[rating]]’).

Example

{"name": "single-v1", "type": "single", "system_prompt": "You are a helpful assistant.", "prompt_template": "[Instruction]\nPlease act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"Rating: [[5]]\".\n\n[Question]\n{question}\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]", "description": "Prompt for general questions", "category": "general", "output_format": "[[rating]]"}

reference.jsonl#

Fields

For LLM-as-a-judge, the reference.jsonl file contains reference answers or ground truth for questions. This file is optional but useful for evaluations where you want to compare model responses against known correct answers.

Field	Type	Required	Description
`question_id`	integer or string	Yes	The question_id this reference is associated with.
`choices`	list of objects	Yes	List of reference answers or context objects. Each object typically has an `index` and a `turns` list (reference answers or context for each turn).

Example

{"question_id": 1, "choices": [{"index": 0, "turns": ["Paris"]}]}
{"question_id": 2, "choices": [{"index": 0, "turns": ["4"]}]}

Tool-Calling#

Evaluate accuracy of function/tool calls. Compare against ground truth calls. Supports OpenAI-compatible function calling format.

Config

{
    "type": "custom",
    "name": "my-configuration-tool-calling-1",
    "namespace": "my-organization",
    "tasks": {
        "my-tool-calling-task": {
            "type": "chat-completion",
            "params": {
                "template": {
                    "messages": [
                        {"role": "user", "content": "{{item.messages[0].content}}"}
                    ]
                }
            },
            "metrics": {
                "tool-calling-accuracy": {
                    "type": "tool-calling",
                    "params": {
                        "tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"
                    }
                }
            },
            "dataset": {
                "files_url": "hf://datasets/default/<my-dataset>"
            }
        }
    }
}

Data Format

{
  "messages": [
    {"role": "user", "content": "Book a table for 2 at 7pm."},
    {"role": "assistant", "content": "Booking a table...", "tool_calls": [{"name": "book_table", "args": {"people": 2, "time": "7pm"}}]}
  ],
  "tool_calls": [
    {"name": "book_table", "args": {"people": 2, "time": "7pm"}}
  ]
}

Result

{
  "tasks": {
    "my-tool-calling-task": {
      "metrics": {
        "tool-calling-accuracy": {
          "scores": {
            "function_name_accuracy": {
              "value": 1.0
            },
            "function_name_and_args_accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}