Download this tutorial as a Jupyter notebook

Run a Benchmark Evaluation#

Learn how to perform an evaluation job with an industry benchmark in approximately 15 minutes.

Prerequisites#

  1. Set up the NeMo Platform Quickstart.

  2. A read access token for your Hugging Face account to access the benchmark dataset.

  3. A build.nvidia.com API key for inference with a hosted model for evaluation.

Tip: Cleanup cells at the end of the notebook can be uncommented to delete resources if needed.

Overview#

Use case: evaluate math capabilities of a large language model.

Objectives#

By the end of this notebook, you will:

  • Discover industry benchmarks

  • Use an industry benchmark to run an evaluation job

  • Evaluate a model on solving grade school math word problems with the GSM8k dataset.

  • View evaluation results.

  • Download job artifacts.

# Install required packages
%pip install -q nemo-platform ipywidgets

Initialize Nemo Platform SDK#

# Imports
import json
import os
import re
import time

from nemo_platform import NeMoPlatform, ConflictError
from nemo_platform.types.evaluation import EvaluationJobParamsParam, Model, SystemBenchmarkOnlineJobParam

# Set variables needed for the tutorial
WORKSPACE = "my-workspace"

# Initialize the SDK client
NMP_BASE_URL = os.environ.get("NMP_BASE_URL", "http://localhost:8080")
client = NeMoPlatform(
    base_url=NMP_BASE_URL,
    workspace=WORKSPACE,
)

Create a workspace to manage your secrets and evaluation jobs.

try:
    client.workspaces.create(name=WORKSPACE)
except ConflictError:
    print(f"Workspace '{WORKSPACE}' already exists, continuing...")

Create Secrets#

In this tutorial, we will use the following:

  • A model hosted on Nvidia Build Hub to be evaluated.

  • A Hugging Face tokenizer for the evaluated model.

You will need to configure authentication for both Nvidia Build Hub and Hugging Face.

Get your credentials to access the models on Nvidia and Hugging Face:

Next, we will configure the API keys as secrets in the NeMo Platform. For more detailed instructions follow Managing Secrets.

Quick Secrets Setup#

# Export NVIDIA_API_KEY and HF_TOKEN environment variables if they are not already set.
# You can also set the env variables here using os.environ['NVIDIA_API_KEY'] = 'sk-xxx', 
# but make sure to delete this setup after to avoid exposing your credentials.
secrets_to_sync = [
    ("nvidia_api_key", "NVIDIA_API_KEY"),
    ("hf-token", "HF_TOKEN"),
]

secret_refs = {}

for name, env_var in secrets_to_sync:
    value = os.getenv(env_var)
    if not value:
        raise ValueError(f"{env_var} is not set")

    try:
        secret = client.secrets.create(name=name, workspace=WORKSPACE, data=value)
        print(f"Created secret: {name}")
    except ConflictError:
        print(f"Secret '{name}' already exists, retrieving...")
        secret = client.secrets.retrieve(name=name, workspace=WORKSPACE)
        print(f"Retrieved existing secret: {name}.")

    secret_refs[name] = secret

nvidia_api_key_secret = secret_refs["nvidia_api_key"]
print(f"NVIDIA_API_KEY secret reference: {nvidia_api_key_secret.workspace}/{nvidia_api_key_secret.name}")

hf_secret = secret_refs["hf-token"]
print(f"HF_TOKEN secret reference: {hf_secret.workspace}/{hf_secret.name}")

Discover Evaluation Benchmarks#

NeMo Platform provides ready-to-use evaluation benchmarks that are available in the reserved system workspace. They can evaluate models against published datasets on a set of pre-defined metrics.

Discover all the industry benchmarks.

all_industry_benchmarks = client.evaluation.benchmarks.list(workspace="system", page_size=200)
print("Number of available industry benchmarks:", all_industry_benchmarks.pagination.total_results)

print("Example benchmark")
print(all_industry_benchmarks.data[0].model_dump_json(indent=2, exclude_none=True))
Output

Number of available industry benchmarks: 131

Example benchmark:

{
  "id": "",
  "entity_id": "",
  "workspace": "system",
  "description": "BFCL v3 simple single-turn function calling. Tests basic function call generation.",
  "labels": {
    "eval_harness": "bfcl",
    "eval_category": "agentic"
  },
  "name": "bfclv3-simple",
  "required_params": [],
  "optional_params": [],
  "supported_job_types": [
    "online"
  ]
}

You can filter industry benchmarks by labels like math or advanced_reasoning. View the benchmark descriptions to choose the one that suits your evaluation needs.

filtered_industry_benchmarks = client.evaluation.benchmarks.list(
    workspace="system",
    extra_query={"search[data.labels.eval_category]": "math"},
)
print("Filtered industry benchmarks:", filtered_industry_benchmarks.pagination.total_results)

for benchmark in filtered_industry_benchmarks:
    print(f"{benchmark.name}: {benchmark.description}")
Output
Filtered industry benchmarks: 12
math-test-500-nemo: math_test_500 questions, math, using NeMo's alignment template
aime-2025-nemo: AIME 2025 questions, math, using NeMo's alignment template
mgsm-cot: MGSM-CoT: The Multilingual Grade School Math (MGSM) benchmark evaluates the reasoning abilities of large language models in multilingual settings. It consists of 250 grade-school math problems from the GSM8K dataset, translated into ten diverse languages, and tests models using chain-of-thought prompting.
gsm8k-cot-instruct: GSM8K-instruct: The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. This variant defaults to chain-of-thought zero-shot evaluation with custom instructions.
gsm8k: GSM8K: The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems.
mgsm: MGSM: The Multilingual Grade School Math (MGSM) benchmark evaluates the reasoning abilities of large language models in multilingual settings. It consists of 250 grade-school math problems from the GSM8K dataset, translated into ten diverse languages, and tests models using chain-of-thought prompting.
aa-aime-2024: AIME 2024 questions, math, using Artificial Analysis's setup.
aa-math-test-500: Open AI math test 500, using Artificial Analysis's setup.
aime-2024: AIME 2024 questions, math
aime-2025: AIME 2025 questions, math
math-test-500: Open AI math test 500
aime-2024-nemo: AIME 2024 questions, math, using NeMo's alignment template

For this tutorial, we will evaluate with the GSM8K benchmark. This benchmark evaluates the arithmetic reasoning of large language models using grade school math word problems.

Inspect the benchmark for details on how to configure the job. This is the recommended pattern before creating any benchmark job because different benchmarks can require different parameters and secret references. You will see that GSM8K supports online evaluations and requires the parameter hf_token. Online evaluation involves live inference calls to a model, whereas an offline evaluation expects a dataset representing pre-generated model outputs.

gsm8k_benchmark = client.evaluation.benchmarks.retrieve(workspace="system", name="gsm8k-cot-instruct")

print(gsm8k_benchmark.model_dump_json(indent=2, exclude_none=True))

Configure Model to Evaluate#

Evaluate any model such as one hosted from build.nvidia.com. We will configure the model with the URL for the hosted model and nvidia_api_key_secret secret created in the Create Secrets step above.

Note: For the GSM8K benchmark, it requires the model’s respective tokenizer to be available on Hugging Face.

We will use the model nvidia/llama-3.3-nemotron-super-49b-v1.5 for this tutorial hosted on build.nvidia.com.

model = Model(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="nvidia/llama-3.3-nemotron-super-49b-v1.5",
    api_key_secret=nvidia_api_key_secret.name,
)

print(model.model_dump_json(indent=2, exclude_none=True))

Run Evaluation Job#

The GSM8K benchmark requires the hf_token parameter to access the dataset from Hugging Face. We will use a hf_secret created in the Create Secrets step above.

gsm8k_benchmark.required_params
Output
[{"name": "hf_token",
  "type": "secret",
  "description": "HuggingFace token for accessing datasets and tokenizers. Required for tasks that fetch from HuggingFace."}]

For this tutorial, we limit the job to only run on 15 samples from the benchmark dataset. Remove params.limit_samples to run the full evaluation.

Note: Parallelism controls the number of concurrent requests to the model during evaluation and can improve the job runtime. Parallelism is set to 1 for the tutorial because https://integrate.api.nvidia.com has a rate limit of 1.

limit_samples = 15

job = client.evaluation.benchmark_jobs.create(
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/gsm8k-cot-instruct",
        benchmark_params={
            "hf_token": hf_secret.name,
        },
        params=EvaluationJobParamsParam(limit_samples=limit_samples, parallelism=1),
        model=model,
    )
)

Monitor the job status until it completes:

job_status = client.evaluation.benchmark_jobs.get_status(name=job.name)
while job_status.status in ("active", "pending", "created"):
    time.sleep(10)
    job_status = client.evaluation.benchmark_jobs.get_status(name=job.name)
    status_details = job_status.status_details
    samples_processed = status_details.samples_processed if status_details else None

    if samples_processed is not None:
        print(f"status: {job_status.status} ({samples_processed}/{limit_samples} samples processed)")
    else:
        print("status:", job_status.status, status_details)
print(job_status.model_dump_json(indent=2, exclude_none=True))

View Evaluation Results#

Evaluation results are available once the evaluation job successfully completes.

The scores for GSM8K range from [0.0, 1.0] measuring the proportion of correct answers.

aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
    job=job.name,
)
print(aggregate.to_json(indent=2))  # Returns typed aggregate score statistics
Output
{
  "scores": [
    {
      "name": "exact_match__flexible-extract",
      "score_type": "range",
      "count": 15,
      "nan_count": 0,
      "sum": 15.0,
      "mean": 1.0,
      "min": 1.0,
      "max": 1.0,
      "std_dev": 0.0,
      "variance": 0.0
    },
    {
      "name": "exact_match__strict-match",
      "score_type": "range",
      "count": 15,
      "nan_count": 0,
      "sum": 15.0,
      "mean": 1.0,
      "min": 1.0,
      "max": 1.0,
      "std_dev": 0.0,
      "variance": 0.0
    }
  ]
}

View Job Artifacts#

View the job logs for more insight into the evaluation.

# Remove ANSI control characters so logs render cleanly in notebooks.
ansi_escape = re.compile(r"\x1B\[[0-?]*[ -/]*[@-~]")

def strip_ansi(text: str) -> str:
    return ansi_escape.sub("", text)

logs_response = client.evaluation.benchmark_jobs.get_logs(name=job.name)
for log_entry in logs_response.data:
    print(f"[{log_entry.timestamp}] {strip_ansi(log_entry.message).strip()}")

# Handle pagination
while logs_response.next_page:
    logs_response = client.evaluation.benchmark_jobs.get_logs(
        name=job.name,
        page_cursor=logs_response.next_page
    )
    for log_entry in logs_response.data:
        print(f"[{log_entry.timestamp}] {strip_ansi(log_entry.message).strip()}")

Download artifacts the job produced during evaluation to a tarball.

artifacts = client.evaluation.benchmark_jobs.results.artifacts.download(job=job.name)
artifacts.write_to_file("evaluation_artifacts.tar.gz")
print("Saved artifacts to evaluation_artifacts.tar.gz")

Extract files from the tarball with the following command and an artifacts directory will be created.

tar -xf evaluation_artifacts.tar.gz

Next Steps#

Scale Up:

  • Run the full evaluation by omitting the job parameter limit_samples. The full evaluation can take up to 5-10 hours.

Apply to Your Domain:

  • Search through available benchmarks and run an evaluation job with another industry benchmark.

  • Evaluate another model, hosted on https://build.nvidia.com or another service or host your own NIM.

Learn More:

Cleanup#

Uncomment cleanup cells as needed to delete resources.

# # Delete evaluation jobs (PERMANENT)
# print("Deleting evaluation jobs...")
# client.evaluation.benchmark_jobs.delete(job.name)
# print(f"Deleted evaluation job {job.name}")
# # Delete secrets
# print("Deleting secrets...")
# client.secrets.delete(nvidia_api_key_secret)
# client.secrets.delete(hf_secret)