Run a Simple Evaluation#
Learn how to perform a simple evaluation using the following components:
Evaluation Target: LLM Model
Evaluation Type: LM Evaluation Harness
Note
The time to complete this tutorial is approximately 25 minutes. In this tutorial, you run an evaluation job. For more information on evaluation job duration, refer to Expected Evaluation Duration.
Prerequisites#
Before you can use this documentation, do the following:
Install NeMo Evaluator. For more information, see Deploy the NeMo Evaluator Microservice.
The URL of the evaluator API depends on where you deploy Evaluator and how you configure it. Store the service URL to use it in your code.
Important
Replace
<your evaluator service endpoint>
with your full service URL including the appropriate protocol (http:// or https://) before you run this code.export EVALUATOR_SERVICE_URL="http(s)://<your evaluator service endpoint>"
import requests EVALUATOR_SERVICE_URL = "http(s)://<your evaluator service endpoint>"
Perform a health check to ensure that the service is available. If everything is working properly, it should return a response of
healthy
.curl -X "GET" "${EVALUATOR_SERVICE_URL}/health" \ -H 'accept: application/json'
endpoint = f"{EVALUATOR_SERVICE_URL}/health" response = requests.get(endpoint).json() response
1. Create a Target#
First, create an evaluation target that meets the LLM Model requirements. Each target is uniquely identified by a combination of namespace
and name
. For example, my-organization/my-target
.
Use the following code to create your target. Make sure that you specify the actual NIM service URL in the request body (don’t use ${NIM_SERVICE_URL}
, as it doesn’t automatically resolve).
curl -X "POST" "${EVALUATOR_SERVICE_URL}/v1/evaluation/targets" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "model",
"name": "my-model-target-1",
"namespace": "my-organization",
"model": {
"api_endpoint": {
"url": "${NIM_SERVICE_URL}/v1/chat/completions",
"model_id": "meta/llama-3.1-8b-instruct",
"format": "openai"
}
}
}'
data = {
"type": "model",
"name": "my-model-target-1",
"namespace": "my-organization",
"model": {
"api_endpoint": {
"url": "${NIM_SERVICE_URL}/v1/chat/completions",
"model_id": "meta/llama-3.1-8b-instruct",
"format": "openai"
}
}
}
endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/targets"
response = requests.post(endpoint, json=data).json()
Note
Chat Templates
When using chat templates for evaluation (apply_chat_template: true
is set in the evaluation config, see the example below:
Use the chat completions endpoint (
/v1/chat/completions
) in the target configuration for the API endpoint URL of the model.Set
format: "openai"
in the target configuration for the model API endpoint.
To see a sample response, refer to Create Evaluation Target.
Tip
If you want to evaluate a model’s ability to show step-by-step reasoning, make sure the model was fine-tuned or configured with detailed thinking on
in the system message. For direct answers, use detailed thinking off
. Refer Reasoning Considerations for details on how to set this during dataset preparation.
Tip
To pass a system prompt (such as enabling detailed reasoning) to Llama Nemotron models during evaluation, set the system_instruction
field in your configuration’s params.extra
section. Refer to Passing System Prompts for details and examples.
2. Create a Configuration#
Next, you create an evaluation configuration. For this example, you create an LM Evaluation Harness configuration.
Important
When using chat templates (apply_chat_template: true
), you must also configure the HuggingFace tokenizer:
Set
tokenizer_backend: "hf"
to use the HuggingFace tokenizerSet
tokenizer
to the appropriate HuggingFace model ID (e.g. “meta-llama/Llama-3.1-8B-Instruct”)Set
hf_token
to your HuggingFace access token (required for most model tokenizers)
Use the following code to create your configuration.
Important
Each configuration is uniquely identified by a combination of namespace
and name
. For example my-organization/my-configuration
.
curl -X "POST" "${EVALUATOR_SERVICE_URL}/v1/evaluation/configs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "gsm8k",
"name": "my-gsm8k-config-1",
"namespace": "my-organization",
"params": {
"temperature": 0.00001,
"top_p": 0.00001,
"max_tokens": 256,
"stop": ["<|eot|>"],
"extra": {
"num_fewshot": 8,
"batch_size": 16,
"bootstrap_iters": 100000,
"dataset_seed": 42,
"use_greedy": true,
"top_k": 1,
"apply_chat_template": true,
"fewshot_as_multiturn": true,
"hf_token": "<your-hf-token>",
"tokenizer_backend": "hf",
"tokenizer": "meta-llama/Llama-3.1-8B-Instruct"
}
}
}'
data = {
"type": "gsm8k",
"name": "my-gsm8k-config-1",
"namespace": "my-organization",
"params": {
"temperature": 0.00001,
"top_p": 0.00001,
"max_tokens": 256,
"stop": ["<|eot|>"],
"extra": {
"num_fewshot": 8,
"batch_size": 16,
"bootstrap_iters": 100000,
"dataset_seed": 42,
"use_greedy": True,
"top_k": 1,
"apply_chat_template": True,
"fewshot_as_multiturn": True,
"hf_token": "<your-hf-token>",
"tokenizer_backend": "hf",
"tokenizer": "meta-llama/Llama-3.1-8B-Instruct"
}
}
}
endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/configs"
response = requests.post(endpoint, json=data).json()
3. Create and Launch a Job#
Finally, you create an evaluation job that uses the target and configuration that you created in the previous steps. When you call the API to create the job, the evaluation starts.
Important
Each job is uniquely identified by a job_id
that the Evaluator service creates. For example eval-1234ABCD5678EFGH
.
Use the following code to create and launch your job.
curl -X "POST" "${EVALUATOR_SERVICE_URL}/v1/evaluation/jobs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"namespace": "my-organization",
"target": "my-organization/my-model-target-1",
"config": "my-organization/my-gsm8k-config-1"
}'
data = {
"namespace": "my-organization",
"target": "my-organization/my-model-target-1",
"config": "my-organization/my-gsm8k-config-1"
}
endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/jobs"
response = requests.post(endpoint, json=data).json()
# Get the job_id so we can refer to it later
job_id = response['id']
print(f"Job ID: {job_id}")
# Get the status. You should see `CREATED` or `PENDING`, or `RUNNING`.
job_status = response['status']
print(f"Job status: {job_status}")
To see a sample response, refer to API.
For more information about the response format, refer to API.
4. Get the Status of Your Evaluation Job#
To get the status of the evaluation job that you submitted in the previous step, use the following code.
Note
For non-custom evaluations like LM Evaluation Harness, the progress percentage in the API may not update regularly. However, as long as the status is running
, the evaluation is proceeding. The time to complete depends on factors like model performance and dataset size.
curl -X "GET" "${EVALUATOR_SERVICE_URL}/v1/evaluation/jobs/<job-id>/status" \
-H 'accept: application/json'
endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/jobs/<job-id>/status"
response = requests.get(endpoint).json()
response
You receive a response similar to the following, which contains the status of each task, and the percentage of progress already completed. For more information, refer to Get Evaluation Job Status.
{
"message": "completed",
"task_status": {
},
"progress": 100
}
Next Steps#
To view and download evaluation results, see Use the Results of Your Job.
To learn how to bring your own data, see Using Custom Data.
To learn about supported evaluation types, see Evaluation Types.
For the full API reference, see Evaluator API.