Run a Simple Evaluation#

Learn how to perform a simple evaluation using the following components:

Note

The time to complete this tutorial is approximately 25 minutes. In this tutorial, you run an evaluation job. For more information on evaluation job duration, refer to Expected Evaluation Duration.

Prerequisites#

Before you can use this documentation, do the following:

  1. Install NeMo Evaluator. For more information, see Deploy the NeMo Evaluator Microservice.

  2. The URL of the evaluator API depends on where you deploy Evaluator and how you configure it. Store the service URL to use it in your code.

    Important

    Replace <your evaluator service endpoint> with your full service URL including the appropriate protocol (http:// or https://) before you run this code.

    export EVALUATOR_SERVICE_URL="http(s)://<your evaluator service endpoint>"
    
    import requests
    
    EVALUATOR_SERVICE_URL = "http(s)://<your evaluator service endpoint>" 
    
  3. Perform a health check to ensure that the service is available. If everything is working properly, it should return a response of healthy.

    curl -X "GET" "${EVALUATOR_SERVICE_URL}/health" \
    -H 'accept: application/json'
    
    endpoint = f"{EVALUATOR_SERVICE_URL}/health"
    response = requests.get(endpoint).json()
    response
    

1. Create a Target#

First, create an evaluation target that meets the LLM Model requirements. Each target is uniquely identified by a combination of namespace and name. For example, my-organization/my-target.

Use the following code to create your target. Make sure that you specify the actual NIM service URL in the request body (don’t use ${NIM_SERVICE_URL}, as it doesn’t automatically resolve).

curl -X "POST" "${EVALUATOR_SERVICE_URL}/v1/evaluation/targets" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "model",
        "name": "my-model-target-1",
        "namespace": "my-organization",
        "model": {
            "api_endpoint": {
                "url": "${NIM_SERVICE_URL}/v1/chat/completions",
                "model_id": "meta/llama-3.1-8b-instruct",
                "format": "openai"
            }
        }
    }'
data = {
   "type": "model",
   "name": "my-model-target-1",
   "namespace": "my-organization",
   "model": {
      "api_endpoint": {
         "url": "${NIM_SERVICE_URL}/v1/chat/completions",
         "model_id": "meta/llama-3.1-8b-instruct",
         "format": "openai"
      }
   }
}

endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/targets"

response = requests.post(endpoint, json=data).json()

Note

Chat Templates

When using chat templates for evaluation (apply_chat_template: true is set in the evaluation config, see the example below:

  1. Use the chat completions endpoint (/v1/chat/completions) in the target configuration for the API endpoint URL of the model.

  2. Set format: "openai" in the target configuration for the model API endpoint.

To see a sample response, refer to Create Evaluation Target.

Tip

If you want to evaluate a model’s ability to show step-by-step reasoning, make sure the model was fine-tuned or configured with detailed thinking on in the system message. For direct answers, use detailed thinking off. Refer Reasoning Considerations for details on how to set this during dataset preparation.

Tip

To pass a system prompt (such as enabling detailed reasoning) to Llama Nemotron models during evaluation, set the system_instruction field in your configuration’s params.extra section. Refer to Passing System Prompts for details and examples.

2. Create a Configuration#

Next, you create an evaluation configuration. For this example, you create an LM Evaluation Harness configuration.

Important

When using chat templates (apply_chat_template: true), you must also configure the HuggingFace tokenizer:

  1. Set tokenizer_backend: "hf" to use the HuggingFace tokenizer

  2. Set tokenizer to the appropriate HuggingFace model ID (e.g. “meta-llama/Llama-3.1-8B-Instruct”)

  3. Set hf_token to your HuggingFace access token (required for most model tokenizers)

Use the following code to create your configuration.

Important

Each configuration is uniquely identified by a combination of namespace and name. For example my-organization/my-configuration.

curl -X "POST" "${EVALUATOR_SERVICE_URL}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "gsm8k",
        "name": "my-gsm8k-config-1",
        "namespace": "my-organization",
        "params": {
            "temperature": 0.00001,      
            "top_p": 0.00001,
            "max_tokens": 256,
            "stop": ["<|eot|>"],
            "extra": {
                "num_fewshot": 8,
                "batch_size": 16,
                "bootstrap_iters": 100000,
                "dataset_seed": 42,
                "use_greedy": true,
                "top_k": 1,
                "apply_chat_template": true,
                "fewshot_as_multiturn": true,
                "hf_token": "<your-hf-token>",
                "tokenizer_backend": "hf",
                "tokenizer": "meta-llama/Llama-3.1-8B-Instruct"
            }
        }
    }'
data = {
    "type": "gsm8k",
    "name": "my-gsm8k-config-1",
    "namespace": "my-organization",
    "params": {
        "temperature": 0.00001,      
        "top_p": 0.00001,
        "max_tokens": 256,
        "stop": ["<|eot|>"],
        "extra": {
            "num_fewshot": 8,
            "batch_size": 16,
            "bootstrap_iters": 100000,
            "dataset_seed": 42,
            "use_greedy": True,
            "top_k": 1,
            "apply_chat_template": True,
            "fewshot_as_multiturn": True,
            "hf_token": "<your-hf-token>",
            "tokenizer_backend": "hf", 
            "tokenizer": "meta-llama/Llama-3.1-8B-Instruct"
        }
    }
}

endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

3. Create and Launch a Job#

Finally, you create an evaluation job that uses the target and configuration that you created in the previous steps. When you call the API to create the job, the evaluation starts.

Important

Each job is uniquely identified by a job_id that the Evaluator service creates. For example eval-1234ABCD5678EFGH.

Use the following code to create and launch your job.

curl -X "POST" "${EVALUATOR_SERVICE_URL}/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": "my-organization/my-model-target-1",
        "config": "my-organization/my-gsm8k-config-1"
    }'
data = {
   "namespace": "my-organization",
   "target": "my-organization/my-model-target-1",
   "config": "my-organization/my-gsm8k-config-1"
}

endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/jobs"

response = requests.post(endpoint, json=data).json()

# Get the job_id so we can refer to it later

job_id = response['id']
print(f"Job ID: {job_id}")

# Get the status.  You should see `CREATED` or `PENDING`, or `RUNNING`.

job_status = response['status']
print(f"Job status: {job_status}")

To see a sample response, refer to API.

For more information about the response format, refer to API.

4. Get the Status of Your Evaluation Job#

To get the status of the evaluation job that you submitted in the previous step, use the following code.

Note

For non-custom evaluations like LM Evaluation Harness, the progress percentage in the API may not update regularly. However, as long as the status is running, the evaluation is proceeding. The time to complete depends on factors like model performance and dataset size.

curl -X "GET" "${EVALUATOR_SERVICE_URL}/v1/evaluation/jobs/<job-id>/status" \
  -H 'accept: application/json'
endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/jobs/<job-id>/status"
response = requests.get(endpoint).json()
response

You receive a response similar to the following, which contains the status of each task, and the percentage of progress already completed. For more information, refer to Get Evaluation Job Status.

{
  "message": "completed",
  "task_status": {
  },
  "progress": 100
}

Next Steps#