Custom Evaluations#

Besides the standard evaluation types, you can run custom evaluation jobs in NVIDIA NeMo Evaluator.

Evaluate a Model on Custom Datasets#

You can run a custom evaluation job to evaluate a model on custom datasets by comparing the LLM-generated response with a ground truth response. For a tutorial that walks you through the steps to evaluate the Llama 3.1 8B Instruct model as an example, refer to Customize the Evaluation Loop.

Data for Custom Evaluations#

The input dataset for the task is configured at the task level. This dataset can be directly specified within the configuration of the evaluation job.

You must ensure all fields that you intend to use for prompt and scoring templates have been added to your dataset before using it in a custom evaluation. The following example depicts a basic CSV dataset.

"question","answer","reference_answer"
What is 2+?,4, The answer is 4
Square root of 256?,16, The answer is 16
What power of 2 is 1024? ,10, The answer is 10

After you have formatted your file and verified that it includes all the fields you intend to use, save the file and upload it to your dataset into NeMo Data Store. Follow the steps at Use Custom Data with NVIDIA NeMo Evaluator.

Input Dataset Formats#

The input dataset can be formatted in the following ways:

  • CSV — Comma-separated values format (not supported for custom tool calling evaluation).

  • JSON — An array of JSON objects (default format).

  • JSONL — JSON Lines format, where each line is a separate JSON object.

Templating for Tasks#

When rendering templates, two default objects are available:

- **item** — Represents the current item from the dataset.
- **sample** — Contains data related to the output from the model. The `sample.output_text` represents the completion text for completion models and the content of the first message for chat models.

The properties on the `item` object are derived from the dataset's column names (for CSVs) or keys (for JSONs). 
The following rules apply to these properties:

- All non-alphanumeric characters are replaced with underscores.
- Column names are converted to lowercase.
- In case of conflicts, suffixes (`_1`, `_2`, etc.), are appended to the property names.

Templates for Chat Models#

Prompt templates are used to structure tasks for evaluating the performance of models, specifically following the NIM/OpenAI format for chat-completion tasks. Templates use the Jinja2 templating syntax. Variables are represented using double-curly brackets, for example, {{item.review}}.

Example Template for Chat-Completion Task

In the case of a chat-completion task, the template should follow the NIM/OpenAI format. An example structure is provided below:

{
    "messages": [{
        "role": "system",
        "content": "You are an expert in analyzing the sentiment of movie reviews."
    }, {
        "role": "user",
        "content": "Determine if the following review is positive or negative: {{item.review}}"
    }]
}

Simple Chat Templating#

When you evaluate a model that has been fine-tuned with message-formatted data, or when you evaluate an endpoint and you want to use the message format, you must configure your Jinja template based on your custom data.

If your custom data is structured as prompt and ideal_response, you can structure this as a single-turn chat.

{ 
    "messages": [{
        "role": "system", 
        "content": "You are an expert in analyzing the sentiment of movie reviews."
    }, { 
        "role": "user", 
        "content": "Determine if the following review is positive or negative: {{item.prompt}}"
    }] 
} 

You can include this in a call to a /chat/completion endpoint.

   {
    "config": {
      "type": "custom",
      "tasks": {
        "qa": {
          "type": "completion",
          "params": {
            "template": {
              "messages": [{
                "role": "system",
                "content": "You are a helpful, respectful and honest assistant. \nExtract from the following context the minimal span word for word that best answers the question.\n."
              }, { 
                "role": "user",
                "content": "Context: {{item.prompt}}"
              }] 
            }
          },
          "metrics": {
            "accuracy": {
              "type": "string-check",
              "params": {
                "check": [
                  "{{sample.output_text}}",
                  "contains",
                  "{{item.ideal_response}}"
                ]
              }
            }
          },
          "dataset": {
            "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
          }
        }
      }
    },
    "target": {
      "type": "model",
      "model": {
        "api_endpoint": {
          "url": "http://<my-nim-url>/chat/completions",
          "model_id": "<my-model-id>"
        }
      }
    }
  }

Messages Data Template#

If your custom data is already message formatted, you can configure your template similar to the following.

{
    "messages": "{{ item.messages | tojson }}"
}

Evaluating with Similarity Metrics#

The custom evaluation with bleu and string-check metrics is ideal for cases where the LLM generations are not expected to be highly creative. For more information, refer to Similarity Metrics.

String-Check Metric#

The string-check metric allows for simple string comparisons between a reference text and the output text. Various string operations such as ‘equals’, ‘not equals’, ‘contains’, ‘not contains’, ‘startswith’, ‘endswith’ can be performed. The operation is carried out between two templates that can use the same variables as the task templates.

A single score is provided as the output by the string-check metric:

  • The score is 1 if the check is successful.

  • The score is 0 otherwise.

Example Configuration with String Check Metric

{
    "type": "string-check",
    "params": {
        "check": [
            "{{sample.output_text}}",
            "equals",
            "{{item.label}}"
        ]
    }
}

BLEU Metric#

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human. For more information, refer to Wikipedia and IBM Documentation.

The bleu metric for custom evaluation requires references params to be set, pointing to where the ground thruths in the input dataset are located.

Example Configuration with BLEU Metric

{
  "bleu": {
    "type": "bleu",
    "params": {
      "references": [
        "{{reference_answer}}"
      ]
    }
  }
}

Evaluation With LLM-as-a-Judge#

With LLM-as-a-Judge, an LLM can be evaluated by using another LLM as a judge. LLM-as-a-Judge can be used for any use case, even highly generative ones, but the choice of judge is crucial in getting reliable evaluations. For more information, refer to LLM-as-a-Judge.

The llm-judge metric is complex and involves prompting the LLM with a specific task, then parsing the output to extract a score. This metric allows you to specify the model to use, the template for the task, and the parser for the output.

Use the parser field to specify the type of parser to use. Currently, only regex is supported. The parser extracts the score from the output text based on a specified pattern:

  • The first group in the pattern is used as the value for the score.

If you don’t specify a parser, the output text is used as the score, and is converted to int or float.

Configuration#

The following is an example configuration for a custom evaluation with the LLM Judge metric.

{
    "type": "llm-judge",
    "params": {
        "model": "model-name",
        "task_template": {
            "messages": [{
                "role": "system",
                "content": "You are an expert in evaluating the quality of responses."
            }, {
                "role": "user",
                "content": "Please evaluate the following response: {{sample.output_text}}"
            }]
        },
        "parser": {
            "type": "regex",
            "pattern": "Rating: (\\d+)"
        }
    }
}

Example#

The following example launches an evaluation job that performs scoring using string check, BLEU and LLM-Judge metrics.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
{
    "namespace": "my-organization",
    "target": {
      "type": "model",
      "model": {
        "api_endpoint": {
          "url": "http://<my-nim-url>/completions",
          "model_id": "<my-model-id>"
        }
      }
    },
    "config": {
      "type": "custom",
      "tasks": {
        "qa": {
          "type": "completion",
          "params": {
            "template": {
              "prompt": "Answer very briefly (no explanation) this question: {{question}}.\nAnswer: ",
              "max_tokens": 30
            }
          },
          "metrics": {
            "accuracy": {
              "type": "string-check",
              "params": {
                "check": [
                  "{{sample.output_text}}",
                  "contains",
                  "{{item.answer}}"
                ]
              }
            },
            "bleu": {
              "type": "bleu",
              "params": {
                "references": [
                  "{{reference_answer}}"
                ]
              }
            },
            "accuracy-2": {
              "type": "llm-judge",
              "params": {
                "model": {
                  "api_endpoint": {
                    "url": "https://<my-judge-nim-url>",
                    "model_id": "<my-judge-model-id>"
                  }
                },
                "template": {
                  "messages": [
                    {
                      "role": "system",
                      "content": "Your task is to evaluate the semantic similarity between two responses."
                    },
                    {
                      "role": "user",
                      "content": "Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10. \n\nRESPONSE 1: {{item.reference_answer}}\n\nRESPONSE 2: {{sample.output_text}}.\n\n"
                    }
                  ]
                },
                "scores": {
                  "similarity": {
                    "type": "int",
                    "parser": {
                      "type": "regex",
                      "pattern": "SIMILARITY: (\\d)"
                    }
                  }
                }
              }
            }
          },
          "dataset": {
            "files_url": "hf://datasets/default/<my-math-dataset>"
          }
        }
      }
    }
}'
data = {
    "namespace": "my-organization",
    "target": {
      "type": "model",
      "model": {
        "api_endpoint": {
          "url": "http://<my-nim-url>/completions",
          "model_id": "<my-model-id>"
        }
      }
    },
    "config": {
      "type": "custom",
      "tasks": {
        "qa": {
          "type": "completion",
          "params": {
            "template": {
              "prompt": "Answer very briefly (no explanation) this question: {{question}}.\nAnswer: ",
              "max_tokens": 30
            }
          },
          "metrics": {
            "accuracy": {
              "type": "string-check",
              "params": {
                "check": [
                  "{{sample.output_text}}",
                  "contains",
                  "{{item.answer}}"
                ]
              }
            },
            "bleu": {
              "type": "bleu",
              "params": {
                "references": [
                  "{{reference_answer}}"
                ]
              }
            },
            "accuracy-2": {
              "type": "llm-judge",
              "params": {
                "model": {
                  "api_endpoint": {
                    "url": "https://<my-judge-nim-url>",
                    "model_id": "<my-judge-model-id>"
                  }
                },
                "template": {
                  "messages": [
                    {
                      "role": "system",
                      "content": "Your task is to evaluate the semantic similarity between two responses."
                    },
                    {
                      "role": "user",
                      "content": "Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10. \n\nRESPONSE 1: {{item.reference_answer}}\n\nRESPONSE 2: {{sample.output_text}}.\n\n"
                    }
                  ]
                },
                "scores": {
                  "similarity": {
                    "type": "int",
                    "parser": {
                      "type": "regex",
                      "pattern": "SIMILARITY: (\\d)"
                    }
                  }
                }
              }
            }
          },
          "dataset": {
            "files_url": "hf://datasets/default/<my-math-dataset>"
          }
        }
      }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs"

# Make the API call
response = requests.post(endpoint, json=data).json()

# Get the job_id so we can refer to it later
job_id = response['id']
print(f"Job ID: {job_id}")

To see a sample response, refer to Create Job Response.

Results#

After the evaluation job completes, go to the /v1/evaluation/jobs/{job_id}/results endpoint to see the results and the metric scores.

{
  "created_at": "2025-03-08T01:12:40.986445",
  "updated_at": "2025-03-08T01:12:40.986445",
  "id": "evaluation_result-P7SA6X9b5DP9xuYXzTMvo8",
  "job": "eval-EgWCHn3bosKsr5jGiKPT9w",
  "tasks": {
    "qa": {
      "metrics": {
        "accuracy": {
          "scores": {
            "string-check": {
              "value": 0.833333333333333,
              "stats": {
                "count": 6,
                "sum": 5,
                "mean": 0.833333333333333
              }
            }
          }
        },
        "bleu": {
          "scores": {
            "sentence": {
              "value": 1.39085669840765,
              "stats": {
                "count": 6,
                "sum": 8.34514019044593,
                "mean": 1.39085669840765
              }
            },
            "corpus": {
              "value": 0
            }
          }
        },
        "accuracy-2": {
          "scores": {
            "similarity": {
              "value": 2.5,
              "stats": {
                "count": 6,
                "sum": 15,
                "mean": 2.5
              }
            }
          }
        }
      }
    }
  },
  "groups": {
  },
  "namespace": "default",
  "custom_fields": {
  }
}

Custom Tool Calling Evaluation#

Custom evaluation lets you evaluate models for tool calling capabilities.

The custom tool calling evaluation only supports evaluation of OpenAI-compatible model APIs and requires the input data (both inputs and ground truths) to be formatted into the OpenAI format for tool calling. For more information, refer to Function calling.

The custom tool calling evaluation only supports chat completions API. This means the task type can only be chat-completion (see full example below).

Parameters#

For tool calling evaluation, at a minimum, you need to provide messages and tools in the evaluation config parameters.

  • messages — the chat messages to pass to the model. You regularly put the messages as JSON into the dataset and point to where it is using item as the distinction of the row in a dataset. For example: "{{ item.messages | tojson }}" uses the messages JSON object from each row in the dataset. For any JSON structure used here, make sure to add | tojson so that the template is rendered into the JSON.

  • tools — the list of tools to provide to the model to use. This must be formatted in OpenAI-compatible tools format. You regularly put the tools as JSON into the dataset and point to where it is using item. as the distinction of the row in a dataset. For example: "{{ item.tools | tojson }}" uses the tools JSON object from each row in the dataset. For any JSON structure used here, make sure to add | tojson so that the template is rendered into the JSON.

  • tool_choice — Optional setting that forces the model to determine when and how many tools to use. For more information, refer to Tool choice.

Warning

Currently, we don’t support functions with more than 8 parameters.

Metrics#

To use the tool calling metric add the metric with type tool-calling.

The tool calling metric requires a pointer to where the ground truths for tool_calls to be set into "tool_calls_ground_truth" metric param.

    "metrics": {
        "my-tool-calling-accuracy": {
            "type": "tool-calling",
            "params": {"tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"}
        }
    }

The tool_calls ground truth need to be in a format of array of objects with a “function” in it, each having function name and arguments (similar to OpenAI-compatible tool_calls response, except args are decoded into JSON).

[{
    "function": {
        "name": "calculate_triangle_area",
        "arguments": {
            "base": 10,
            "height": 5,
            "unit": "units"
        }
    }
}]

The metric produces the following scores:

  • function_name_accuracy — the accuracy of tool choice (using function name). The metric score is case sensitive and order insensitive.

  • function_name_and_args_accuracy — the accuracy of tool choice (using function name) and arguments. The metric score is case sensitive and order sensitive (for args order).

Warning

Custom evaluation for tool calling is currently limited with only the following scenarios available:

  • simple (single turn requests)

  • parallel (parallel function calling - with function calling unordered)

Dataset#

The following is an example dataset for a custom tool calling evaluation.

[
    {
        "messages": [
            {
                "role": "user",
                "content": "Find the area of a triangle with a base of 10 units and height of 5 units."
            }
        ],
        "tools": [
            {
                "type": "function",
                "function": {
                    "name": "calculate_triangle_area",
                    "description": "Calculate the area of a triangle given its base and height.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "base": {
                                "type": "integer",
                                "description": "The base of the triangle."
                            },
                            "height": {
                                "type": "integer",
                                "description": "The height of the triangle."
                            },
                            "unit": {
                                "type": "string",
                                "description": "The unit of measure (defaults to \"units\" if not specified)"
                            }
                        },
                        "required": [
                            "base",
                            "height"
                        ]
                    }
                }
            }
        ],
        "tool_calls": [
            {
                "function": {
                    "name": "calculate_triangle_area",
                    "arguments": {
                        "base": 10,
                        "height": 5,
                        "unit": "units"
                    }
                }
            }
        ]
    }
]

Example#

The following example launches a custom evaluation job for tool calling.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
{
    "namespace": "my-organization",
    "target": {
      "type": "model",
      "model": {
        "api_endpoint": {
          "url": "http://<my-nim-url>/completions",
          "model_id": "<my-model-id>"
        }
      }
    },
    "config": {
        "type": "custom",
        "tasks": {
            "my-custom-tool-calling": {
                "type": "chat-completion",
                "dataset": {
                    "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
                },
                "params": {
                    "template": {
                        "messages": "{{ item.messages | tojson}}",
                        "tools": "{{ item.tools | tojson }}",
                        "tool_choice": "auto"
                    }
                },
                "metrics": {
                    "tool-calling-accuracy": {
                        "type": "tool-calling",
                        "params": {"tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"}
                    }
                }
            }
        }
    }
}'
data = {
    "namespace": "my-organization",
    "target": {
      "type": "model",
      "model": {
        "api_endpoint": {
          "url": "http://<my-nim-url>/completions",
          "model_id": "<my-model-id>"
        }
      }
    },
    "config": {
        "type": "custom",
        "tasks": {
            "my-custom-tool-calling": {
                "type": "chat-completion",
                "dataset": {
                    "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
                },
                "params": {
                    "template": {
                        "messages": "{{ item.messages | tojson}}",
                        "tools": "{{ item.tools | tojson }}",
                        "tool_choice": "auto"
                    }
                },
                "metrics": {
                    "tool-calling-accuracy": {
                        "type": "tool-calling",
                        "params": {"tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"}
                    }
                }
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs"

# Make the API call
response = requests.post(endpoint, json=data).json()

# Get the job_id so we can refer to it later
job_id = response['id']
print(f"Job ID: {job_id}")

To see a sample response, refer to Create Job Response.

Results#

After the evaluation job completes, go to the /v1/evaluation/jobs/{job_id}/results endpoint to see the results and the metric scores.

{
  "created_at": "2025-03-13T23:51:01.551970",
  "updated_at": "2025-03-13T23:51:01.551972",
  "id": "evaluation_result-KCC511Pq4WqpyUVbkXK5Ks",
  "job": "eval-tKCnL6PiWDo7CabN3p2GH",
  "tasks": {
    "custom-tool-calling": {
      "metrics": {
        "tool-calling-accuracy": {
          "scores": {
            "function_name_accuracy": {
              "value": 0.3475,
              "stats": {
                "count": 400,
                "sum": 139,
                "mean": 0.3475
              }
            },
            "function_args_accuracy": {
              "value": 0.1625,
              "stats": {
                "count": 400,
                "sum": 65,
                "mean": 0.1625
              }
            },
            "function_name_and_args_accuracy": {
              "value": 0.1625,
              "stats": {
                "count": 400,
                "sum": 65,
                "mean": 0.1625
              }
            }
          }
        }
      }
    }
  },
  "groups": {

  },
  "namespace": "default",
  "custom_fields": {

  }
}

Output Format and Structure#

The output of the evaluation task is a dataset that includes the model’s response and any computed metrics. The dataset is designed to facilitate the analysis of the model’s performance by matching inputs with outputs and providing relevant metrics.

Output Dataset Structure

The output dataset contains at least two mandatory columns:

  • id — A unique identifier used to match the output with the corresponding input from the dataset.

  • output_text — The output generated by the model.

ID Column

  • If the input dataset contains an id column, it is used to match the input with the output.

  • If no id column is present in the input dataset, the row number is used as the ID.

For each computed metric at the sample level, a new column is added to the output dataset. The column name corresponds to the metric name, and the value is the computed metric value.

The output file is named results.json and is stored at the location specified by EvaluationJob.output_files_url. If output_files_url is not provided and the data store microservice is not available, the output file is discarded.