Simple Evaluations#

Simple Evaluations are collection of benchmark evaluation types for language models, including various math benchmarks, GPQA variants and MMLU in over 30 languages.

Tip

Want to experiment first? You can try these benchmarks using the open-source NeMo Evaluator SDK before deploying the microservice. The SDK provides a lightweight way to test evaluation workflows locally.

Target Configuration

All of the Simple-Evals evaluations require a chat endpoint configuration.

{
  "target": {
    "type": "model", 
    "model": {
      "api_endpoint": {
        "url": "https://<nim-base-url>/v1/chat/completions",
        "model_id": "meta/llama-3.3-70b-instruct"
      }
    }
  }
}

Example Job Execution#

You can execute an Evaluation Job using either the Python SDK or cURL as follows, replacing <my-eval-config> with configs shown on this page:

Note

See Job Target and Configuration Matrix for details on target / config compatibility.

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.v2.evaluation.jobs.create(
    spec={
        "target": {
            "type": "model",
            "name": "my-target-dataset-1",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            },
        },
        "config": <my-eval-config>
    }
)
curl -X "POST" "$EVALUATOR_BASE_URL/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "spec": {
            "target": {
                "type": "model",
                "name": "my-target-dataset-1",
                "namespace": "my-organization",
                "model": {
                    "api_endpoint": {
                        # Replace NIM_BASE_URL with your specific deployment
                        "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                        "model_id": "meta/llama-3.1-8b-instruct"
                    }
                }
            },
            "config": <my-eval-config>
        }
    }'
from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.evaluation.jobs.create(
    namespace="my-organization",
    target={
        "type": "model",
        "namespace": "my-organization",
        "model": {
            "api_endpoint": {
                # Replace NIM_BASE_URL with your specific deployment
                "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                "model_id": "meta/llama-3.1-8b-instruct"
            }
        },
    },
    config=<my-eval-config>
)
curl -X "POST" "$EVALUATOR_BASE_URL/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": {
            "type": "model",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            }
        },
        "config": <my-eval-config>
    }'

For a full example, see Run an Academic LM Harness Eval


Supported Tasks#

Simple Evaluation Tasks by Category#

Category

Example Task(s)

Description

Advanced Reasoning

gpqa_diamond, gpqa_experts, gpqa_extended, gpqa_main

Graduate-level Q&A tasks.

Language Understanding in Multiple Languages

mmlu_am, mmlu_ar, …, mmlu_yo (see below for full list)

Massive Multitask Language Understanding; covers 57 subjects across STEM, humanities, and more.

Math & Reasoning

AIME_2025, AIME_2024, AA_AIME_2024, AA_math_test_500, math_test_500, simpleqa

Math word problems. SimpleQA has shorter fact-seeking questions.

Benchmarks Using NeMo’s Alignment Template

aime_2025_nemo, aime_2024_nemo, math_test_500_nemo, gpqa_diamond_nemo

Same benchmarks as above but using NeMo’s alignment template format.


Advanced Reasoning (GPQA)#

Simple-Evals includes several GPQA variants. For all of these, include the extra parameter hf_token in order to access the gated dataset Idavidrein/gpqa.

{
    "type": "gpqa_diamond",
    "params": {
        "limit_samples": 50,
        "parallelism": 50,
        "request_timeout": 300,
        "extra": {
            "model_type": "chat",
            "num_fewshot": 5,
            "hf_token": "hf_XXXXXX"
        }
    }
}
{
  "question": "What is the capital of France?",
  "choices": ["Paris", "London", "Berlin", "Madrid"],
  "answer": "Paris",
  "output": "Paris"
}
{
  "tasks": {
    "gpqa_diamond": {
      "metrics": {
        "score": {
          "scores": {
            "micro": {
              "value": 0.24,
              "stats": {
                "stddev": 0.427083130081253,
                "stderr": 0.0610118757258932
              }
            }
          }
        }
      }
    }
  }
}

Language Understanding (MMLU)#

Simple-Evals offers the MMLU benchmark in a variety of languages.

Multi-language Code Generation Tasks#

Task

Language

mmlu_am

Amharic

mmlu_ar

Arabic

mmlu_bn

Bengali

mmlu_cs

Czech

mmlu_de

German

mmlu_el

Greek

mmlu_en

English (US)

mmlu_es

Spanish (LA)

mmlu_fa

Farsi

mmlu_fil

Filipino

mmlu_fr

French

mmlu_ha

Hausa

mmlu_he

Hebrew

mmlu_hi

Hindi

mmlu_id

bahasa Indonesia

mmlu_ig

Igbo

mmlu_it

Italian

mmlu_ja

Japanese

mmlu_ko

Korean

mmlu_ky

Kyrgyz

mmlu_lt

Lithuanian

mmlu_mg

Malagasy

mmlu_ms

Malay

mmlu_ne

Nepali

mmlu_nl

Dutch

mmlu_ny

Chichewa, also known as Chewa or Nyanja

mmlu_pl

Polish

mmlu_pt

Portuguese (BR)

mmlu_ro

Romanian

mmlu_ru

Russian

mmlu_si

Sinhala

mmlu_sn

Shona

mmlu_so

Somali

mmlu_sr

Serbian

mmlu_sv

Swedish

mmlu_sw

Swahili

mmlu_te

Telugu

mmlu_tr

Turkish

mmlu_uk

Ukrainian

mmlu_vi

Vietnamese

mmlu_yo

Yoruba

{
  "type": "mmlu_am",
  "params": {
    "limit_samples": 50,
    "parallelism": 50,
    "request_timeout": 300,
    "top_p": 0.00001,
    "extra": {
      "model_type": "chat",
      "num_fewshot": 5
    }
  }
}

{
  "tasks": {
    "stem": {
      "metrics": {
        "score": {
          "scores": {
            "micro": {
              "value": 0.1,
              "stats": {
                "stderr": 0.0428571428571428
              }
            }
          }
        }
      }
    }
  }
}

Math & Reasoning (Math Test 500)#

{
  "type": "math_test_500",
  "params": {
    "limit_samples": 50,
    "parallelism": 50,
    "request_timeout": 300,
    "extra": {
      "model_type": "chat",
      "num_fewshot": 5,
      "judge": {
        "model": {
          "api_endpoint": {
            "model_id": "meta/llama-3.3-70b-instruct",
            "url": "https://<nim-base-url>/v1/chat/completions"
          }
        }
      }
    }
  }
}

The judge model should be at least 70B parameters, otherwise metrics evaluation might fail due to judge output not matching the specified metric template. Visit Troubleshooting Unsupported Judge Model for more details.


{
  "tasks": {
    "gpqa_diamond_cot_zeroshot": {
      "metrics": {
        "exact_match__flexible-extract": {
          "scores": {
            "exact_match__flexible-extract": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Benchmarks Using NeMo Alignment Template (Math Test 500 - NeMo)#

{
    "type": "math_test_500_nemo",
    "params": {
        "limit_samples": 50,
        "parallelism": 50,
        "request_timeout": 300,
        "extra": {
            "model_type": "chat",
            "num_fewshot": 5
        }
    }
}

{
  "tasks": {
    "math_test_500_nemo": {
      "metrics": {
        "score": {
          "scores": {
            "micro": {
              "value": 0.32,
              "stats": {
                "stddev": 0.466476151587624,
                "stderr": 0.0666394502268034
              }
            }
          }
        }
      }
    }
  }
}

Parameters#

Request Parameters#

These parameters control how requests are made to the model:

Name

Description

Type

Default

Valid Values

max_retries

Maximum number of retries for failed requests.

Integer

Container default

parallelism

Number of parallel requests to improve throughput.

Integer

Container default

request_timeout

Timeout in seconds for each request.

Integer

Container default

limit_samples

Limit the number of samples to evaluate. Useful for testing.

Integer

null (all samples)

Model Parameters#

These parameters control the model’s generation behavior:

Name

Description

Type

Default

Valid Values

temperature

Sampling temperature for generation.

Float

Container default

0.0–2.0

top_p

Nucleus sampling parameter.

Float

0.00001

0.01–1.0

max_new_tokens

Maximum number of output sequence tokens.

Integer

4096

Extra Parameters#

Set these parameters in the params.extra section:

Name

Description

Type

Default

Valid Values

model_type

Type of model interface to use. Required for the underlying container, but Evaluator will attempt to guess if not provided.

String

Auto-detected

"chat" for all Simple-Evals benchmarks

hf_token

HuggingFace token for accessing datasets and tokenizers. Required for tasks that fetch from HuggingFace.

String

Valid HF token

num_fewshot

Number of examples in few-shot context.

Integer

Task-dependent

downsampling_ratio

Ratio for downsampling the dataset.

Float

null (no downsampling)

0.0–1.0

Extra Judge Parameters#

Set these parameters in the params.extra.judge.model section:

Name

Description

Type

Default

Valid Values

api_endpoint.url

URL of judge model.

String

-

-

api_endpoint.model_id

ID of judge model.

String

-

-

api_endpoint.api_key

API Key to authenticate with judge model

String

-

-

api_endpoint.format

"openai" for OpenAI compatible judges; "nim" for direct calls via aiohttp

String

"nim"

"openai", "nim"

prompt.inference_params.temperature

Sampling temperature for generation.

Float

0.0

-

prompt.inference_params.top_p

Nucleus sampling parameter.

Float

0.00001

-

prompt.inference_params.max_tokens

Maximum number of output sequence tokens.

Integer

1024

-