Run an Academic LM Harness Eval#

Learn how to perform a simple evaluation using the following components:

Note

The time to complete this tutorial is approximately 15 minutes. In this tutorial, you run an evaluation job. For more information on evaluation job duration, refer to Expected Evaluation Duration.

Prerequisites#

  1. Set up Evaluator with Docker Compose and deploy meta/llama-3.2-3b-instruct.

  2. This tutorial uses a model tokenizer from Hugging Face which requires:

  3. Store your service URLs as variables to use it in your code.

    • Base URL is the main endpoint for interacting with the NeMo Microservices Platform.

    • Inference base URL is the inference endpoint for models.

    • Nemo Data Store URL is service address for dataset storage and exposes a Hugging-Face-compatible API.

    from nemo_microservices import NeMoMicroservices
    
    # Set variables and initialize the client
    client = NeMoMicroservices(
        base_url="http://localhost:8080",
        inference_base_url="http://localhost:8000",
    )
    hf_token = "<your readonly Hugging Face token>"
    
    export NEMO_MICROSERVICES_BASE_URL="http://localhost:8080"
    export NIM_BASE_URL="http://localhost:8000"
    export HF_TOKEN="<your readonly Hugging Face token>"
    

    Important

    Reuse the above client in the steps mentioned below.

    Update the URLs accordingly for your deployment:

  4. Verify service URLs before starting the tutorial.

    jobs = client.v2.evaluation.jobs.list()
    print(jobs)
    
    import requests
    resp = requests.get(f"{client.inference_base_url}/v1/models")
    print(resp)
    
    curl ${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs
    {"object":"list","data":[],"pagination":{}}
    
    curl ${NIM_BASE_URL}/v1/models
    {
      "object": "list",
      "data": [{
        "id": "meta/llama-3.2-3b-instruct",
        "object": "model",
        "created": 1760457209,
        "owned_by": "system",
        "root": "meta/llama-3.2-3b-instruct",
        "parent": null,
        "max_model_len": 131072,
        "permission": []
      }]
    }
    

1. Create a job#

The following example illustrates the creation of an evaluation job created within a specified namespace (my-organization) and made up of two components:

  • target – the model under evaluation

  • config – specifies the evaluation to perform and its parameters

When you call the API to create the job, the evaluation starts automatically.

Configure the model and tokenizer.

model = {
  "api_endpoint": {
    "url": f"{client.inference_base_url}/v1/completions",
    "model_id": "meta/llama-3.2-3b-instruct"
  }
}
tokenizer = "meta-llama/Llama-3.2-3B-Instruct"

If you have Evaluator deployed following the Demo Cluster Setup on minikube, configure the model by ID.

model = "meta/llama-3.1-8b-instruct"
tokenizer = "meta-llama/Llama-3.1-8B-Instruct"

If you use a hosted model such as one from build.nvidia.com, configure the model with the URL and API key for the hosted model.

model = {
  "api_endpoint": {
    "url": "https://integrate.api.nvidia.com/v1/completions",
    "model_id": "meta/llama-3.2-3b-instruct",
    "api_key": "<your build.nvidia.com API key>"
  }
}
tokenizer = "meta-llama/Llama-3.2-3B-Instruct"

Configure and create your job.

job = client.v2.evaluation.jobs.create(
    spec={
        "target": {
            "type": "model",
            "model": model
        },
        "config": {
            "type": "gsm8k",
            "params": {
                "temperature": 0.00001,      
                "top_p": 0.00001,
                "max_tokens": 256,
                "stop": ["<|eot|>"],
                "parallelism": 5,
                "max_retries": 10,
                "request_timeout": 30,
                "limit_samples": 10,
                "extra": {
                    "num_fewshot": 8,
                    "batch_size": 16,
                    "bootstrap_iters": 100000,
                    "dataset_seed": 42,
                    "use_greedy": True,
                    "top_k": 1,
                    "apply_chat_template": True,
                    "fewshot_as_multiturn": True,
                    "hf_token": HF_TOKEN,
                    "tokenizer_backend": "hf",
                    "tokenizer": tokenizer
                }
            }
        }
    }
)

Jobs are uniquely identified by an id and additional details of the evaluation job can be viewed such as its status:

print(f"Job ID: {job.id}")
print(f"Project: {job.project}")
print(f"Job Status: {job.status}")

This should print out something like:

Job ID: job-dq1pjj6vj5p64xaeqgvuk4
Project: my-project
Job Status: created

Configure the model.

export MODEL='{
  "api_endpoint": {
    "url": "'${NIM_BASE_URL}'/v1/completions",
    "model_id": "meta/llama-3.2-3b-instruct"
  }
}'
export MODEL_TOKENIZER="meta-llama/Llama-3.2-3B-Instruct"

If you have Evaluator deployed following the Demo Cluster Setup on minikube, configure the model by ID.

export MODEL='"meta/llama-3.1-8b-instruct"'
export MODEL_TOKENIZER="meta-llama/Llama-3.1-8B-Instruct"

If you use a hosted model such as one from build.nvidia.com, configure the model with the URL and API key for the hosted model.

export MODEL='{
  "api_endpoint": {
    "url": "https://integrate.api.nvidia.com/v1/completions",
    "model_id": "meta/llama-3.2-3b-instruct",
    "api_key": "<your build.nvidia.com API key>"
  }
}'
export MODEL_TOKENIZER="meta-llama/Llama-3.2-3B-Instruct"
curl -X "POST" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "spec": {
            "target": {
                "type": "model",
                "model": '${MODEL}'
            },
            "config": {
                "type": "gsm8k",
                "params": {
                    "temperature": 0.00001,
                    "top_p": 0.00001,
                    "max_tokens": 256,
                    "stop": ["<|eot|>"],
                    "parallelism": 5,
                    "max_retries": 10,
                    "request_timeout": 30,
                    "limit_samples": 10,
                    "extra": {
                        "num_fewshot": 8,
                        "batch_size": 16,
                        "bootstrap_iters": 100000,
                        "dataset_seed": 42,
                        "use_greedy": true,
                        "top_k": 1,
                        "apply_chat_template": true,
                        "fewshot_as_multiturn": true,
                        "hf_token": "'${HF_TOKEN}'",
                        "tokenizer_backend": "hf",
                        "tokenizer": "'${MODEL_TOKENIZER}'"
                    }
                }
            }
        }
    }'

Jobs are uniquely identified by an id which can be seen in the JSON response. It can be helpful to set the value of the job ID to a variable in the subsequent steps:

export JOB_ID=<id returned in the JSON response>

echo "Job ID: $JOB_ID"

For more information about the response format, refer to v2 (Preview).

Note

Chat Templates

When using chat templates for evaluation, apply_chat_template: true is set in the evaluation config.

  1. Use the correct endpoint (/v1/completions or /v1/chat/completions) in the target configuration for the API endpoint URL of the model.

  2. Set format: "openai" in the target configuration for the model API endpoint.

  3. Configure the Hugging Face tokenizer:

    1. Set tokenizer_backend: "hf" to use the Hugging Face tokenizer

    2. Set tokenizer to the appropriate Hugging Face model ID (e.g. “meta-llama/Llama-3.2-3B-Instruct”)

    3. Set hf_token to your Hugging Face access token (required for most model tokenizers)

    4. Prerequisite: Ensure you’ve accepted the license agreement for the model on Hugging Face and access has been granted. Otherwise, your job may fail with authorization errors from Hugging Face.

For more details on LM Eval configurations, see LM Evaluation Harness.

2. Get the Status of Your Evaluation Job#

To get the status of the evaluation job that you submitted in the previous step, use the following code. The evaluation should take about 5 minutes to complete.

job_status = client.v2.evaluation.jobs.status.retrieve(job.id)
print(f"Job status: {job_status}")
curl -X "GET" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/status" \
  -H 'accept: application/json'

You receive a response similar to the following, which contains the status of each task, and the percentage of progress already completed. For more information, refer to Get Evaluation Job Status.

{
  "job_id": "job-dq1pjj6vj5p64xaeqgvuk4",
  "status": "active",
  "status_details": {},
  "error_details": null,
  "steps": [
    {
      "name": "evaluation",
      "status": "created",
      "status_details": {},
      "error_details": {},
      "tasks": []
    }
  ]
}

Monitor the job until it completes.

job_status = client.v2.evaluation.jobs.status.retrieve(job.id)
while job_status.status in ("active", "pending", "created"):
    job_status = client.v2.evaluation.jobs.status.retrieve(job.id)
    time.sleep(1)
print(job_status)
curl -X "GET" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/status" \
  -H 'accept: application/json'
{
  "job_id": "job-dq1pjj6vj5p64xaeqgvuk4",
  "status": "completed",
  "status_details": {
    "samples_processed": 10,
    "progress": 100
  },
  "error_details": null,
  "steps": [
    {
      "name": "results",
      "status": "completed"
    },
    {
      "name": "evaluation",
      "status": "completed"
    }
  ]
}

Note

For academic benchmarks, progress is tracked only through samples_processed. The progress percentage will be either 0 (when the evaluation is running) or 100 (when the evaluation has succeeded).

3. View Evaluation Job Results#

Once the job has completed successfully, you can examine the results of the evaluation:

# v2 - Get structured evaluation results
results = client.v2.evaluation.jobs.results.evaluation_results.retrieve(job_id)
print(results.model_dump_json(indent=2, exclude_none=True))

# v2 - List available result types
available_results = client.v2.evaluation.jobs.results.list(job_id)
print(f"Available results: {[r.result_name for r in available_results.data]}")
# Get structured evaluation results
curl -X "GET" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/results/evaluation-results/download" \
  -H 'accept: application/json'

# List available result types
curl -X "GET" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/results" \
  -H 'accept: application/json'

4. Cleanup#

To delete the job created in this tutorial, execute the following:

# v2 - Delete the evaluation job
client.v2.evaluation.jobs.delete(job.id)
print("Job deleted successfully")
curl -X "DELETE" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}" \
  -H 'accept: application/json'

Next Steps#