Run an Academic LM Harness Eval#

Learn how to perform a simple evaluation using the following components:

Note

The time to complete this tutorial is approximately 15 minutes. In this tutorial, you run an evaluation job. For more information on evaluation job duration, refer to Expected Evaluation Duration.

Prerequisites#

  1. Set up Evaluator before following this tutorial. Refer to the Demo Cluster Setup on minikube or production deployment guides for the platform and Evaluator individually.

  2. This tutorial uses a model tokenizer from Hugging Face which requires:

  3. Store the service URL to use it in your code; the URL of the evaluator service depends on where you deploy Evaluator and how you configure it.

    Important

    Replace the variable values below. For your full service URL include the appropriate protocol (http:// or https://) before you run this code.

    from nemo_microservices import NeMoMicroservices
    
    # Set variables and initialize the client
    EVALUATOR_BASE_URL = "http(s)://<your evaluator service endpoint>"
    NIM_BASE_URL = "http(s)://<your NIM service URL>"
    HF_TOKEN = "<your readonly Hugging Face token>"
    
    client = NeMoMicroservices(
        base_url=EVALUATOR_BASE_URL
    )
    
    export EVALUATOR_BASE_URL="http(s)://<your evaluator service endpoint>"
    export NIM_BASE_URL="http(s)://<your NIM service URL>"
    export HF_TOKEN="<your readonly Hugging Face token>"
    

1. Create a job#

The following example illustrates the creation of an evaluation job created within a specified namespace (my-organization) and made up of two components:

  • target – the model under evaluation

  • config – specifies the evaluation to perform and its parameters

When you call the API to create the job, the evaluation starts automatically.

Important

When using chat templates (apply_chat_template: true), we must also configure the Hugging Face tokenizer:

  1. Set tokenizer_backend: "hf" to use the Hugging Face tokenizer

  2. Set tokenizer to the appropriate Hugging Face model ID (e.g. “meta-llama/Llama-3.2-3B-Instruct”)

  3. Set hf_token to your Hugging Face access token (required for most model tokenizers)

  4. Ensure you’ve accepted the license agreement for the model on Hugging Face and access has been granted

v2 (Preview)#

Warning

v2 API Preview: The v2 API is available for testing and feedback but is not yet recommended for production use. Breaking changes may occur before the stable release.

The v2 API introduces a spec envelope at the top level.

job = client.v2.evaluation.jobs.create(
    spec={
        "target": {
            "type": "model",
            "model": {
                "api_endpoint": {
                    "url": f"{NIM_BASE_URL}/v1/completions",
                    "model_id": "meta/llama-3.1-8b-instruct",
                    "format": "openai"
                }
            }
        },
        "config": {
            "type": "gsm8k",
            "params": {
                "temperature": 0.00001,      
                "top_p": 0.00001,
                "max_tokens": 256,
                "stop": ["<|eot|>"],
                "parallelism": 5,
                "max_retries": 10,
                "request_timeout": 30,
                "extra": {
                    "num_fewshot": 8,
                    "batch_size": 16,
                    "bootstrap_iters": 100000,
                    "dataset_seed": 42,
                    "use_greedy": True,
                    "top_k": 1,
                    "apply_chat_template": True,
                    "fewshot_as_multiturn": True,
                    "hf_token": HF_TOKEN,
                    "tokenizer_backend": "hf",
                    "tokenizer": "meta-llama/Llama-3.1-8B-Instruct"
                }
            }
        }
    }
)

Jobs are uniquely identified by an id and additional details of the evaluation job can be viewed such as its status:

print(f"Job ID: {job.id}")
print(f"Project: {job.project}")
print(f"Job Status: {job.status}")

This should print out something like:

Job ID: job-dq1pjj6vj5p64xaeqgvuk4
Project: my-project
Job Status: created
curl -X "POST" "${EVALUATOR_BASE_URL}/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "spec": {
            "target": {
                "type": "model",
                "model": {
                    "api_endpoint": {
                        "url": "'${NIM_BASE_URL}'/v1/completions",
                        "model_id": "meta/llama-3.1-8b-instruct",
                        "format": "openai"
                    }
                }
            },
            "config": {
                "type": "gsm8k",
                "params": {
                    "temperature": 0.00001,
                    "top_p": 0.00001,
                    "max_tokens": 256,
                    "stop": ["<|eot|>"],
                    "parallelism": 5,
                    "max_retries": 10,
                    "request_timeout": 30,
                    "extra": {
                        "num_fewshot": 8,
                        "batch_size": 16,
                        "bootstrap_iters": 100000,
                        "dataset_seed": 42,
                        "use_greedy": true,
                        "top_k": 1,
                        "apply_chat_template": true,
                        "fewshot_as_multiturn": true,
                        "hf_token": "'${HF_TOKEN}'",
                        "tokenizer_backend": "hf",
                        "tokenizer": "meta-llama/Llama-3.1-8B-Instruct"
                    }
                }
            }
        }
    }'

The response will include the new v2 fields:

export JOB_ID=$(curl -s -X "POST" "${EVALUATOR_BASE_URL}/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{...}' | jq -r '.id')

echo "Job ID: $JOB_ID"

Key v2 Differences:

  • Spec envelope: Target and config are wrapped in a required spec object

  • Endpoint: Uses /v2/evaluation/jobs instead of /v1/evaluation/jobs

  • Response structure: Includes the new fields and spec envelope in the response

  • Secrets: To securely use API keys for jobs with the v2 API, the secrets must be defined in-line with the job definition, not referenced from v1 targets or configs. Refer to V2 Secrets.


v1 (Current)#

job = client.evaluation.jobs.create(
    namespace="my-organization",
    target={
        "type": "model",
        "model": {
            "api_endpoint": {
                "url": f"{NIM_BASE_URL}/v1/completions",
                "model_id": "meta/llama-3.1-8b-instruct",
                "format": "openai"
            }
        }
    },
    config={
        "type": "gsm8k",
        "params": {
            "temperature": 0.00001,      
            "top_p": 0.00001,
            "max_tokens": 256,
            "stop": ["<|eot|>"],
            "parallelism": 5,
            "max_retries": 10,
            "request_timeout": 30,
            "extra": {
                "num_fewshot": 8,
                "batch_size": 16,
                "bootstrap_iters": 100000,
                "dataset_seed": 42,
                "use_greedy": True,
                "top_k": 1,
                "apply_chat_template": True,
                "fewshot_as_multiturn": True,
                "hf_token": HF_TOKEN,
                "tokenizer_backend": "hf",
                "tokenizer": "meta-llama/Llama-3.1-8B-Instruct"
            }
        }
    }
)

Jobs are uniquely identified by an id and additional details of the evaluation job can be viewed such as its status:

print(f"Job ID: {job.id}")
print(f"Job Status: {job.status}")

This should print out something like:

Job ID: eval-K65X29YxKKji4tBzqmF3qT
Job Status: created
curl -X "POST" "${EVALUATOR_BASE_URL}/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": {
            "type": "model",
            "model": {
                "api_endpoint": {
                    "url": "'$NIM_BASE_URL'/v1/completions",
                    "model_id": "meta/llama-3.1-8b-instruct",
                    "format": "openai"
                }
            }
        },
        "config": {
            "type": "gsm8k",
            "params": {
                "temperature": 0.00001,      
                "top_p": 0.00001,
                "max_tokens": 256,
                "stop": ["<|eot|>"],
                "parallelism": 5,
                "max_retries": 10,
                "request_timeout": 30,
                "extra": {
                    "num_fewshot": 8,
                    "batch_size": 16,
                    "bootstrap_iters": 100000,
                    "dataset_seed": 42,
                    "use_greedy": true,
                    "top_k": 1,
                    "apply_chat_template": true,
                    "fewshot_as_multiturn": true,
                    "hf_token": "'$HF_TOKEN'",
                    "tokenizer_backend": "hf",
                    "tokenizer": "meta-llama/Llama-3.1-8B-Instruct"
                }
            }
        }
    }'

Jobs are uniquely identified by an id which can be seen in the JSON response. It can be helpful to set the value of the job ID to a variable in the subsequent steps:

export JOB_ID=<id returned in the JSON response>

For more information about the response format, refer to v2 (Preview).

Note

Chat Templates

When using chat templates for evaluation, apply_chat_template: true is set in the evaluation config.

  1. Use the correct endpoint (/v1/completions or /v1/chat/completions) in the target configuration for the API endpoint URL of the model.

  2. Set format: "openai" in the target configuration for the model API endpoint.

For more details on LM Eval configurations, see LM Evaluation Harness.

Tip

If you want to evaluate a model’s ability to show step-by-step reasoning, make sure the model was fine-tuned or configured with detailed thinking on in the system message. For direct answers, use detailed thinking off. Refer Reasoning Considerations for details on how to set this during dataset preparation.

3. Get the Status of Your Evaluation Job#

To get the status of the evaluation job that you submitted in the previous step, use the following code. The evaluation should take about 5 minutes to complete.

Note

For non-custom evaluations like LM Evaluation Harness, the progress percentage in the API may not update regularly. However, as long as the status is running, the evaluation is proceeding. The time to complete depends on factors like model performance and dataset size.

v2 (Preview)#

In v2, status information is consolidated into the main job details response.

# v2 - Get all job details including status in one call
job_details = client.v2.evaluation.jobs.retrieve(job.id)
print(f"Job status: {job_details.status}")
print(f"Status details: {job_details.status_details}")
if job_details.status_details and 'progress' in job_details.status_details:
    print(f"Progress: {job_details.status_details['progress']}%")
curl -X "GET" "${EVALUATOR_BASE_URL}/v2/evaluation/jobs/${JOB_ID}" \
  -H 'accept: application/json'

v1 (Current)#

job_status = client.evaluation.jobs.status(job.id)
print(f"Job status: {job_status.message}")
print(f"Progress: {job_status.progress}%")

Output should be similar to:

Job Status: Job is running.
Progress: 0.0
curl -X "GET" "${EVALUATOR_BASE_URL}/v1/evaluation/jobs/${JOB_ID}/status" \
  -H 'accept: application/json'

You receive a response similar to the following, which contains the status of each task, and the percentage of progress already completed. For more information, refer to Get Evaluation Job Status.

{
  "message": "completed",
  "task_status": {},
  "progress": 100.0,
  "samples_processed": 1329
}

4. View Evaluation Job Results#

Once the job has completed successfully, you can examine the results of the evaluation:

v2 (Preview)#

# v2 - Get structured evaluation results
results = client.v2.evaluation.jobs.results.evaluation_results.retrieve(job_id)
print(results.model_dump_json(indent=2, exclude_none=True))

# v2 - List available result types
available_results = client.v2.evaluation.jobs.results.list(job_id)
print(f"Available results: {[r.result_name for r in available_results.data]}")
# Get structured evaluation results
curl -X "GET" "${EVALUATOR_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/results/evaluation-results/download" \
  -H 'accept: application/json'

# List available result types
curl -X "GET" "${EVALUATOR_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/results" \
  -H 'accept: application/json'

v1 (Current)#

results = client.evaluation.jobs.results(job.id)
print(f"Result ID: {results.id}")
print(f"Job ID: {results.job}")
print(f"Tasks: {results.tasks}")
print(f"Groups: {results.groups}")
curl -X "GET" "${EVALUATOR_BASE_URL}/v1/evaluation/jobs/${JOB_ID}/results" \
  -H 'accept: application/json'

5. Cleanup#

To delete the job created in this tutorial, execute the following:

v2 (Preview)#

# v2 - Delete the evaluation job
client.v2.evaluation.jobs.delete(job.id)
print("Job deleted successfully")
curl -X "DELETE" "${EVALUATOR_BASE_URL}/v2/evaluation/jobs/${JOB_ID}" \
  -H 'accept: application/json'

v1 (Current)#

client.evaluation.jobs.delete(job.id)
curl -X "DELETE" "${EVALUATOR_BASE_URL}/v1/evaluation/jobs/${JOB_ID}" \
  -H 'accept: application/json'

Next Steps#