Run an Academic LM Harness Eval#

Learn how to perform a simple evaluation using the following components:

Evaluation Target: LLM Model
Evaluation Type: LM Evaluation Harness

Note

The time to complete this tutorial is approximately 15 minutes. In this tutorial, you run an evaluation job. For more information on evaluation job duration, refer to Expected Evaluation Duration.

Prerequisites#

Set up Evaluator with Docker Compose and deploy meta/llama-3.2-3b-instruct.
- If you do not have a GPU to deploy a model, use a hosted model such as one from build.nvidia.com (LLama 3.2 3b Instruct) as detailed below.
This tutorial uses a model tokenizer from Hugging Face which requires:
- Acceptance of the Llama-3.2-3B-Instruct terms and conditions
- A read-only access token for your Hugging Face account
Store your service URLs as variables to use it in your code.
- Base URL is the main endpoint for interacting with the NeMo Microservices Platform.
- Inference base URL is the inference endpoint for models.
- Nemo Data Store URL is service address for dataset storage and exposes a Hugging-Face-compatible API.
Python SDK
from nemo_microservices import NeMoMicroservices # Set variables and initialize the client client = NeMoMicroservices( base_url="http://localhost:8080", inference_base_url="http://localhost:8000", ) hf_token = "<your readonly Hugging Face token>"
cURL
export NEMO_MICROSERVICES_BASE_URL="http://localhost:8080" export NIM_BASE_URL="http://localhost:8000" export HF_TOKEN="<your readonly Hugging Face token>"
Important

Reuse the above client in the steps mentioned below.

Update the URLs accordingly for your deployment:
- If you have Evaluator deployed following the Demo Cluster Setup on minikube
  - Base URL: http://nemo.test
  - Inference base URL: http://nim.test
- If you have Evaluator deployed to a Kubernetes cluster (Evaluator individually or the production platform), update the URLs to the ingress setup.
- If you are using a hosted model from build.nvidia.com
  - Inference base URL: https://integrate.api.nvidia.com

Verify service URLs before starting the tutorial.

Python SDK

jobs = client.v2.evaluation.jobs.list()
print(jobs)

import requests
resp = requests.get(f"{client.inference_base_url}/v1/models")
print(resp)

cURL

curl ${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs
{"object":"list","data":[],"pagination":{}}

curl ${NIM_BASE_URL}/v1/models
{
  "object": "list",
  "data": [{
    "id": "meta/llama-3.2-3b-instruct",
    "object": "model",
    "created": 1760457209,
    "owned_by": "system",
    "root": "meta/llama-3.2-3b-instruct",
    "parent": null,
    "max_model_len": 131072,
    "permission": []
  }]
}

1. Create a job#

The following example illustrates the creation of an evaluation job created within a specified namespace (my-organization) and made up of two components:

target – the model under evaluation
config – specifies the evaluation to perform and its parameters

When you call the API to create the job, the evaluation starts automatically.

Python SDK

Configure the model and tokenizer.

Model with Docker Compose

model = {
  "api_endpoint": {
    "url": f"{client.inference_base_url}/v1/completions",
    "model_id": "meta/llama-3.2-3b-instruct"
  }
}
tokenizer = "meta-llama/Llama-3.2-3B-Instruct"

Model with Minikube Demo Cluster

If you have Evaluator deployed following the Demo Cluster Setup on minikube, configure the model by ID.

model = "meta/llama-3.1-8b-instruct"
tokenizer = "meta-llama/Llama-3.1-8B-Instruct"

Hosted Model

If you use a hosted model such as one from build.nvidia.com, configure the model with the URL and API key for the hosted model.

model = {
  "api_endpoint": {
    "url": "https://integrate.api.nvidia.com/v1/completions",
    "model_id": "meta/llama-3.2-3b-instruct",
    "api_key": "<your build.nvidia.com API key>"
  }
}
tokenizer = "meta-llama/Llama-3.2-3B-Instruct"

Configure and create your job.

job = client.v2.evaluation.jobs.create(
    spec={
        "target": {
            "type": "model",
            "model": model
        },
        "config": {
            "type": "gsm8k",
            "params": {
                "temperature": 0.00001,      
                "top_p": 0.00001,
                "max_tokens": 256,
                "stop": ["<|eot|>"],
                "parallelism": 5,
                "max_retries": 10,
                "request_timeout": 30,
                "limit_samples": 10,
                "extra": {
                    "num_fewshot": 8,
                    "batch_size": 16,
                    "bootstrap_iters": 100000,
                    "dataset_seed": 42,
                    "use_greedy": True,
                    "top_k": 1,
                    "apply_chat_template": True,
                    "fewshot_as_multiturn": True,
                    "hf_token": HF_TOKEN,
                    "tokenizer_backend": "hf",
                    "tokenizer": tokenizer
                }
            }
        }
    }
)

Jobs are uniquely identified by an id and additional details of the evaluation job can be viewed such as its status:

print(f"Job ID: {job.id}")
print(f"Project: {job.project}")
print(f"Job Status: {job.status}")

This should print out something like:

Job ID: job-dq1pjj6vj5p64xaeqgvuk4
Project: my-project
Job Status: created

cURL

Configure the model.

Model with Docker Compose

export MODEL='{
  "api_endpoint": {
    "url": "'${NIM_BASE_URL}'/v1/completions",
    "model_id": "meta/llama-3.2-3b-instruct"
  }
}'
export MODEL_TOKENIZER="meta-llama/Llama-3.2-3B-Instruct"

Model with Minikube Demo Cluster

If you have Evaluator deployed following the Demo Cluster Setup on minikube, configure the model by ID.

export MODEL='"meta/llama-3.1-8b-instruct"'
export MODEL_TOKENIZER="meta-llama/Llama-3.1-8B-Instruct"

Hosted Model

If you use a hosted model such as one from build.nvidia.com, configure the model with the URL and API key for the hosted model.

export MODEL='{
  "api_endpoint": {
    "url": "https://integrate.api.nvidia.com/v1/completions",
    "model_id": "meta/llama-3.2-3b-instruct",
    "api_key": "<your build.nvidia.com API key>"
  }
}'
export MODEL_TOKENIZER="meta-llama/Llama-3.2-3B-Instruct"

curl -X "POST" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "spec": {
            "target": {
                "type": "model",
                "model": '${MODEL}'
            },
            "config": {
                "type": "gsm8k",
                "params": {
                    "temperature": 0.00001,
                    "top_p": 0.00001,
                    "max_tokens": 256,
                    "stop": ["<|eot|>"],
                    "parallelism": 5,
                    "max_retries": 10,
                    "request_timeout": 30,
                    "limit_samples": 10,
                    "extra": {
                        "num_fewshot": 8,
                        "batch_size": 16,
                        "bootstrap_iters": 100000,
                        "dataset_seed": 42,
                        "use_greedy": true,
                        "top_k": 1,
                        "apply_chat_template": true,
                        "fewshot_as_multiturn": true,
                        "hf_token": "'${HF_TOKEN}'",
                        "tokenizer_backend": "hf",
                        "tokenizer": "'${MODEL_TOKENIZER}'"
                    }
                }
            }
        }
    }'

Jobs are uniquely identified by an id which can be seen in the JSON response. It can be helpful to set the value of the job ID to a variable in the subsequent steps:

export JOB_ID=<id returned in the JSON response>

echo "Job ID: $JOB_ID"

For more information about the response format, refer to v2 (Preview).

Note

Chat Templates

When using chat templates for evaluation, apply_chat_template: true is set in the evaluation config.

Use the correct endpoint (/v1/completions or /v1/chat/completions) in the target configuration for the API endpoint URL of the model.
Set format: "openai" in the target configuration for the model API endpoint.
Configure the Hugging Face tokenizer:
1. Set tokenizer_backend: "hf" to use the Hugging Face tokenizer
2. Set tokenizer to the appropriate Hugging Face model ID (e.g. “meta-llama/Llama-3.2-3B-Instruct”)
3. Set hf_token to your Hugging Face access token (required for most model tokenizers)
4. Prerequisite: Ensure you’ve accepted the license agreement for the model on Hugging Face and access has been granted. Otherwise, your job may fail with authorization errors from Hugging Face.

For more details on LM Eval configurations, see LM Evaluation Harness.

2. Get the Status of Your Evaluation Job#

To get the status of the evaluation job that you submitted in the previous step, use the following code. The evaluation should take about 5 minutes to complete.

Python SDK

job_status = client.v2.evaluation.jobs.status.retrieve(job.id)
print(f"Job status: {job_status}")

cURL

curl -X "GET" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/status" \
  -H 'accept: application/json'

You receive a response similar to the following, which contains the status of each task, and the percentage of progress already completed. For more information, refer to Get Evaluation Job Status.

{
  "job_id": "job-dq1pjj6vj5p64xaeqgvuk4",
  "status": "active",
  "status_details": {},
  "error_details": null,
  "steps": [
    {
      "name": "evaluation",
      "status": "created",
      "status_details": {},
      "error_details": {},
      "tasks": []
    }
  ]
}

Monitor the job until it completes.

Python SDK

job_status = client.v2.evaluation.jobs.status.retrieve(job.id)
while job_status.status in ("active", "pending", "created"):
    job_status = client.v2.evaluation.jobs.status.retrieve(job.id)
    time.sleep(1)
print(job_status)

cURL

curl -X "GET" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/status" \
  -H 'accept: application/json'

{
  "job_id": "job-dq1pjj6vj5p64xaeqgvuk4",
  "status": "completed",
  "status_details": {
    "samples_processed": 10,
    "progress": 100
  },
  "error_details": null,
  "steps": [
    {
      "name": "results",
      "status": "completed"
    },
    {
      "name": "evaluation",
      "status": "completed"
    }
  ]
}

Note

For academic benchmarks, progress is tracked only through samples_processed. The progress percentage will be either 0 (when the evaluation is running) or 100 (when the evaluation has succeeded).

3. View Evaluation Job Results#

Once the job has completed successfully, you can examine the results of the evaluation:

Python SDK

# v2 - Get structured evaluation results
results = client.v2.evaluation.jobs.results.evaluation_results.retrieve(job_id)
print(results.model_dump_json(indent=2, exclude_none=True))

# v2 - List available result types
available_results = client.v2.evaluation.jobs.results.list(job_id)
print(f"Available results: {[r.result_name for r in available_results.data]}")

cURL

# Get structured evaluation results
curl -X "GET" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/results/evaluation-results/download" \
  -H 'accept: application/json'

# List available result types
curl -X "GET" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/results" \
  -H 'accept: application/json'

4. Cleanup#

To delete the job created in this tutorial, execute the following:

Python SDK

# v2 - Delete the evaluation job
client.v2.evaluation.jobs.delete(job.id)
print("Job deleted successfully")

cURL

curl -X "DELETE" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}" \
  -H 'accept: application/json'

Next Steps#

To learn move about evaluation results, see Use the Results of Your Job.
To learn how to bring your own data, see Using Custom Data.
To learn about supported evaluation flows, see Evaluation Flows.
For the full API reference, see Evaluator API.