Run a Simple Evaluation#

Learn how to perform a simple evaluation using the following components:

Evaluation Target: LLM Model
Evaluation Type: LM Evaluation Harness

Note

The time to complete this tutorial is approximately 25 minutes. In this tutorial, you run an evaluation job. For more information on evaluation job duration, refer to Expected Evaluation Duration.

Prerequisites#

Before you can use this documentation, do the following:

Install NeMo Evaluator. For more information, see Deploy the NeMo Evaluator Microservice.
The URL of the evaluator API depends on where you deploy Evaluator and how you configure it. Store the service URL to use it in your code.

Important

Replace <your evaluator service endpoint> with your full service URL including the appropriate protocol (http:// or https://) before you run this code.
curl
export EVALUATOR_SERVICE_URL="http(s)://<your evaluator service endpoint>"
Python
import requests EVALUATOR_SERVICE_URL = "http(s)://<your evaluator service endpoint>"

Perform a health check to ensure that the service is available. If everything is working properly, it should return a response of healthy.

curl

curl -X "GET" "${EVALUATOR_SERVICE_URL}/health" \
-H 'accept: application/json'

Python

endpoint = f"{EVALUATOR_SERVICE_URL}/health"
response = requests.get(endpoint).json()
response

1. Create a Target#

First, create an evaluation target that meets the LLM Model requirements. Each target is uniquely identified by a combination of namespace and name. For example, my-organization/my-target.

Use the following code to create your target. Make sure that you specify the actual NIM service URL in the request body (don’t use ${NIM_SERVICE_URL}, as it doesn’t automatically resolve).

curl

curl -X "POST" "${EVALUATOR_SERVICE_URL}/v1/evaluation/targets" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "model",
        "name": "my-model-target-1",
        "namespace": "my-organization",
        "model": {
            "api_endpoint": {
                "url": "${NIM_SERVICE_URL}/v1/chat/completions",
                "model_id": "meta/llama-3.1-8b-instruct",
                "format": "openai"
            }
        }
    }'

Python

data = {
   "type": "model",
   "name": "my-model-target-1",
   "namespace": "my-organization",
   "model": {
      "api_endpoint": {
         "url": "${NIM_SERVICE_URL}/v1/chat/completions",
         "model_id": "meta/llama-3.1-8b-instruct",
         "format": "openai"
      }
   }
}

endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/targets"

response = requests.post(endpoint, json=data).json()

Note

Chat Templates

When using chat templates for evaluation (apply_chat_template: true is set in the evaluation config, see the example below:

Use the chat completions endpoint (/v1/chat/completions) in the target configuration for the API endpoint URL of the model.
Set format: "openai" in the target configuration for the model API endpoint.

To see a sample response, refer to Create Evaluation Target.

Tip

If you want to evaluate a model’s ability to show step-by-step reasoning, make sure the model was fine-tuned or configured with detailed thinking on in the system message. For direct answers, use detailed thinking off. Refer Reasoning Considerations for details on how to set this during dataset preparation.

Tip

To pass a system prompt (such as enabling detailed reasoning) to Llama Nemotron models during evaluation, set the system_instruction field in your configuration’s params.extra section. Refer to Passing System Prompts for details and examples.

2. Create a Configuration#

Next, you create an evaluation configuration. For this example, you create an LM Evaluation Harness configuration.

Important

When using chat templates (apply_chat_template: true), you must also configure the HuggingFace tokenizer:

Set tokenizer_backend: "hf" to use the HuggingFace tokenizer
Set tokenizer to the appropriate HuggingFace model ID (e.g. “meta-llama/Llama-3.1-8B-Instruct”)
Set hf_token to your HuggingFace access token (required for most model tokenizers)

Use the following code to create your configuration.

Important

Each configuration is uniquely identified by a combination of namespace and name. For example my-organization/my-configuration.

curl

curl -X "POST" "${EVALUATOR_SERVICE_URL}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "gsm8k",
        "name": "my-gsm8k-config-1",
        "namespace": "my-organization",
        "params": {
            "temperature": 0.00001,      
            "top_p": 0.00001,
            "max_tokens": 256,
            "stop": ["<|eot|>"],
            "extra": {
                "num_fewshot": 8,
                "batch_size": 16,
                "bootstrap_iters": 100000,
                "dataset_seed": 42,
                "use_greedy": true,
                "top_k": 1,
                "apply_chat_template": true,
                "fewshot_as_multiturn": true,
                "hf_token": "<your-hf-token>",
                "tokenizer_backend": "hf",
                "tokenizer": "meta-llama/Llama-3.1-8B-Instruct"
            }
        }
    }'

Python

data = {
    "type": "gsm8k",
    "name": "my-gsm8k-config-1",
    "namespace": "my-organization",
    "params": {
        "temperature": 0.00001,      
        "top_p": 0.00001,
        "max_tokens": 256,
        "stop": ["<|eot|>"],
        "extra": {
            "num_fewshot": 8,
            "batch_size": 16,
            "bootstrap_iters": 100000,
            "dataset_seed": 42,
            "use_greedy": True,
            "top_k": 1,
            "apply_chat_template": True,
            "fewshot_as_multiturn": True,
            "hf_token": "<your-hf-token>",
            "tokenizer_backend": "hf", 
            "tokenizer": "meta-llama/Llama-3.1-8B-Instruct"
        }
    }
}

endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

3. Create and Launch a Job#

Finally, you create an evaluation job that uses the target and configuration that you created in the previous steps. When you call the API to create the job, the evaluation starts.

Important

Each job is uniquely identified by a job_id that the Evaluator service creates. For example eval-1234ABCD5678EFGH.

Use the following code to create and launch your job.

curl

curl -X "POST" "${EVALUATOR_SERVICE_URL}/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": "my-organization/my-model-target-1",
        "config": "my-organization/my-gsm8k-config-1"
    }'

Python

data = {
   "namespace": "my-organization",
   "target": "my-organization/my-model-target-1",
   "config": "my-organization/my-gsm8k-config-1"
}

endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/jobs"

response = requests.post(endpoint, json=data).json()

# Get the job_id so we can refer to it later

job_id = response['id']
print(f"Job ID: {job_id}")

# Get the status.  You should see `CREATED` or `PENDING`, or `RUNNING`.

job_status = response['status']
print(f"Job status: {job_status}")

To see a sample response, refer to API.

For more information about the response format, refer to API.

4. Get the Status of Your Evaluation Job#

To get the status of the evaluation job that you submitted in the previous step, use the following code.

Note

For non-custom evaluations like LM Evaluation Harness, the progress percentage in the API may not update regularly. However, as long as the status is running, the evaluation is proceeding. The time to complete depends on factors like model performance and dataset size.

curl

curl -X "GET" "${EVALUATOR_SERVICE_URL}/v1/evaluation/jobs/<job-id>/status" \
  -H 'accept: application/json'

Python

endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/jobs/<job-id>/status"
response = requests.get(endpoint).json()
response

You receive a response similar to the following, which contains the status of each task, and the percentage of progress already completed. For more information, refer to Get Evaluation Job Status.

{
  "message": "completed",
  "task_status": {
  },
  "progress": 100
}

Next Steps#

To view and download evaluation results, see Use the Results of Your Job.
To learn how to bring your own data, see Using Custom Data.
To clean up, delete your job, config, and target.
To learn about supported evaluation types, see Evaluation Types.
For the full API reference, see Evaluator API.