Run a Simple Evaluation#

This documentation helps you get started with NVIDIA NeMo Evaluator by walking through a simple LM Evaluation Harness evaluation that targets an LLM Model.

Note

The time to complete this tutorial is approximately 25 minutes. In this tutorial, you run an evaluation job. For more information on evaluation job duration, refer to Expected Evaluation Duration.

Prerequisites#

Before you can use this documentation, do the following:

  1. Install NeMo Evaluator. For more information, see NeMo Evaluator Deployment Guide.

  2. The URL of the evaluator API depends on where you deploy evaluator and how you configure it. Store the evaluator hostname to use it in your code.

    Important

    Replace <your evaluator service endpoint> with your address, such as evaluator.internal.your-company.com, before you run this code.

    export EVALUATOR_HOSTNAME="<your evaluator service endpoint>"
    
    import requests
    
    EVALUATOR_HOSTNAME = "<your evaluator service endpoint>" 
    
  3. Perform a health check to ensure that the service is available. If everything is working properly, it should return a response of healthy.

    curl -X "GET" "http://${EVALUATOR_HOSTNAME}/health" \
    -H 'accept: application/json'
    
    endpoint = f"http://{EVALUATOR_HOSTNAME}/health"
    response = requests.get(endpoint).json()
    response
    

Step 1: Create a Target#

First, you create an evaluation target. For this example, you create an LLM Model target.

Use the following code to create your target. Make sure that you specify the {NIM_HOSTNAME} value from one of the following options:

  • If you have a deployed NIM, specify the NIM endpoint URL to {NIM_HOSTNAME}.

  • If your cluster administrator installed the NeMo Microservices Helm Chart, you can use the NIM Proxy API endpoint URL http://nemo-nim-proxy:8000.

  • If your cluster administrator installed the NeMo Evaluator and NIM Proxy individually, you can use the service URL of the NIM Proxy microservice if installed in the same cluster or the fully qualified domain name (FQDN) if installed outside the cluster.

Important

Each target is uniquely identified by a combination of namespace and name. For example my-organization/my-target.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/targets" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "model",
        "name": "my-model-target-1",
        "namespace": "my-organization",
        "model": {
            "api_endpoint": {
                "url": "http://{NIM_HOSTNAME}/v1/completions",
                "model_id": "meta/llama-3.1-8b-instruct"
            }
        }
    }'
data = {
   "type": "model",
   "name": "my-model-target-1",
   "namespace": "my-organization",
   "model": {
      "api_endpoint": {
         "url": "http://{NIM_HOSTNAME}/v1/completions",
         "model_id": "meta/llama-3.1-8b-instruct"
      }
   }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/targets"

response = requests.post(endpoint, json=data).json()

To see a sample response, refer to Create Target Response.

Step 2: Create a Configuration#

Next, you create an evaluation configuration. For this example, you create an LM Evaluation Harness configuration.

Use the following code to create your configuration.

Important

Each configuration is uniquely identified by a combination of namespace and name. For example my-organization/my-configuration.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "gsm8k",
        "name": "my-gsm8k-config-1",
        "namespace": "my-organization",
        "params": {
            "temperature": 0.00001,      
            "top_p": 0.00001,
            "max_tokens": 256,
            "stop": ["<|eot|>"],
            "extra": {
                "num_fewshot": 8,
                "batch_size": 16,
                "bootstrap_iters": 100000,
                "dataset_seed": 42,
                "use_greedy": true,
                "top_k": 1,
                "apply_chat_template": true,
                "fewshot_as_multiturn": true
            }
        }
    }'
data = {
    "type": "gsm8k",
    "name": "my-gsm8k-config-1",
    "namespace": "my-organization",
    "params": {
        "temperature": 0.00001,      
        "top_p": 0.00001,
        "max_tokens": 256,
        "stop": ["<|eot|>"],
        "extra": {
            "num_fewshot": 8,
            "batch_size": 16,
            "bootstrap_iters": 100000,
            "dataset_seed": 42,
            "use_greedy": True,
            "top_k": 1,
            "apply_chat_template": True,
            "fewshot_as_multiturn": True
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

To see a sample response, refer to Create Config Response.

Step 3: Create and Launch a Job#

Finally, you create an evaluation job that uses the target and configuration that you created in the previous steps. When you call the API to create the job, the evaluation starts.

Important

Each job is uniquely identified by a job_id that the Evaluator service creates. For example eval-1234ABCD5678EFGH.

Use the following code to create and launch your job.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": "my-organization/my-model-target-1",
        "config": "my-organization/my-gsm8k-config-1"
    }'
data = {
   "namespace": "my-organization",
   "target": "my-organization/my-model-target-1",
   "config": "my-organization/my-gsm8k-config-1"
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs"

response = requests.post(endpoint, json=data).json()

# Get the job_id so we can refer to it later

job_id = response['id']
print(f"Job ID: {job_id}")

# Get the status.  You should see `CREATED` or `PENDING`, or `RUNNING`.

job_status = response['status']
print(f"Job status: {job_status}")

To see a sample response, refer to Create Job Response.

Step 4: Get the Status of Your Evaluation Job#

To get the status of the evaluation job that you submitted in the previous step, use the following code.

curl -X "GET" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>/status" \
  -H 'accept: application/json'
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>/status"
response = requests.get(endpoint).json()
response

You receive a response similar to the following, which contains the status of each task, and the percentage of progress already completed. For more information, refer to Get the Status of an Evaluation Job.

{
  "message": "completed",
  "task_status": {
  },
  "progress": 100
}

Next Steps#