Simple Evaluations#
Simple Evaluations are collection of benchmark evaluation types for language models, including various math benchmarks, GPQA variants and MMLU in over 30 languages.
Tip
Want to experiment first? You can try these benchmarks using the open-source NeMo Evaluator SDK before deploying the microservice. The SDK provides a lightweight way to test evaluation workflows locally.
Target Configuration
All of the Simple-Evals evaluations require a chat endpoint configuration.
{
"target": {
"type": "model",
"model": {
"api_endpoint": {
"url": "https://<nim-base-url>/v1/chat/completions",
"model_id": "meta/llama-3.3-70b-instruct"
}
}
}
}
Example Job Execution#
You can execute an Evaluation Job using either the Python SDK or cURL as follows, replacing <my-eval-config> with configs shown on this page:
Note
See Job Target and Configuration Matrix for details on target / config compatibility.
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(
base_url="http(s)://<your evaluator service endpoint>"
)
job = client.v2.evaluation.jobs.create(
spec={
"target": {
"type": "model",
"name": "my-target-dataset-1",
"namespace": "my-organization",
"model": {
"api_endpoint": {
# Replace NIM_BASE_URL with your specific deployment
"url": f"{NIM_BASE_URL}/v1/chat/completions",
"model_id": "meta/llama-3.1-8b-instruct"
}
},
},
"config": <my-eval-config>
}
)
curl -X "POST" "$EVALUATOR_BASE_URL/v2/evaluation/jobs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"spec": {
"target": {
"type": "model",
"name": "my-target-dataset-1",
"namespace": "my-organization",
"model": {
"api_endpoint": {
# Replace NIM_BASE_URL with your specific deployment
"url": f"{NIM_BASE_URL}/v1/chat/completions",
"model_id": "meta/llama-3.1-8b-instruct"
}
}
},
"config": <my-eval-config>
}
}'
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(
base_url="http(s)://<your evaluator service endpoint>"
)
job = client.evaluation.jobs.create(
namespace="my-organization",
target={
"type": "model",
"namespace": "my-organization",
"model": {
"api_endpoint": {
# Replace NIM_BASE_URL with your specific deployment
"url": f"{NIM_BASE_URL}/v1/chat/completions",
"model_id": "meta/llama-3.1-8b-instruct"
}
},
},
config=<my-eval-config>
)
curl -X "POST" "$EVALUATOR_BASE_URL/v1/evaluation/jobs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"namespace": "my-organization",
"target": {
"type": "model",
"namespace": "my-organization",
"model": {
"api_endpoint": {
# Replace NIM_BASE_URL with your specific deployment
"url": f"{NIM_BASE_URL}/v1/chat/completions",
"model_id": "meta/llama-3.1-8b-instruct"
}
}
},
"config": <my-eval-config>
}'
For a full example, see Run an Academic LM Harness Eval
Supported Tasks#
Category |
Example Task(s) |
Description |
|---|---|---|
Advanced Reasoning |
|
Graduate-level Q&A tasks. |
Language Understanding in Multiple Languages |
|
Massive Multitask Language Understanding; covers 57 subjects across STEM, humanities, and more. |
Math & Reasoning |
|
Math word problems. SimpleQA has shorter fact-seeking questions. |
Benchmarks Using NeMo’s Alignment Template |
|
Same benchmarks as above but using NeMo’s alignment template format. |
Advanced Reasoning (GPQA)#
Simple-Evals includes several GPQA variants. For all of these, include the extra parameter hf_token
in order to access the gated dataset Idavidrein/gpqa.
{
"type": "gpqa_diamond",
"params": {
"limit_samples": 50,
"parallelism": 50,
"request_timeout": 300,
"extra": {
"model_type": "chat",
"num_fewshot": 5,
"hf_token": "hf_XXXXXX"
}
}
}
{
"question": "What is the capital of France?",
"choices": ["Paris", "London", "Berlin", "Madrid"],
"answer": "Paris",
"output": "Paris"
}
{
"tasks": {
"gpqa_diamond": {
"metrics": {
"score": {
"scores": {
"micro": {
"value": 0.24,
"stats": {
"stddev": 0.427083130081253,
"stderr": 0.0610118757258932
}
}
}
}
}
}
}
}
Language Understanding (MMLU)#
Simple-Evals offers the MMLU benchmark in a variety of languages.
Task |
Language |
|---|---|
|
Amharic |
|
Arabic |
|
Bengali |
|
Czech |
|
German |
|
Greek |
|
English (US) |
|
Spanish (LA) |
|
Farsi |
|
Filipino |
|
French |
|
Hausa |
|
Hebrew |
|
Hindi |
|
bahasa Indonesia |
|
Igbo |
|
Italian |
|
Japanese |
|
Korean |
|
Kyrgyz |
|
Lithuanian |
|
Malagasy |
|
Malay |
|
Nepali |
|
Dutch |
|
Chichewa, also known as Chewa or Nyanja |
|
Polish |
|
Portuguese (BR) |
|
Romanian |
|
Russian |
|
Sinhala |
|
Shona |
|
Somali |
|
Serbian |
|
Swedish |
|
Swahili |
|
Telugu |
|
Turkish |
|
Ukrainian |
|
Vietnamese |
|
Yoruba |
{
"type": "mmlu_am",
"params": {
"limit_samples": 50,
"parallelism": 50,
"request_timeout": 300,
"top_p": 0.00001,
"extra": {
"model_type": "chat",
"num_fewshot": 5
}
}
}
{
"tasks": {
"stem": {
"metrics": {
"score": {
"scores": {
"micro": {
"value": 0.1,
"stats": {
"stderr": 0.0428571428571428
}
}
}
}
}
}
}
}
Math & Reasoning (Math Test 500)#
{
"type": "math_test_500",
"params": {
"limit_samples": 50,
"parallelism": 50,
"request_timeout": 300,
"extra": {
"model_type": "chat",
"num_fewshot": 5,
"judge": {
"model": {
"api_endpoint": {
"model_id": "meta/llama-3.3-70b-instruct",
"url": "https://<nim-base-url>/v1/chat/completions"
}
}
}
}
}
}
The judge model should be at least 70B parameters, otherwise metrics evaluation might fail due to judge output not matching the specified metric template. Visit Troubleshooting Unsupported Judge Model for more details.
{
"tasks": {
"gpqa_diamond_cot_zeroshot": {
"metrics": {
"exact_match__flexible-extract": {
"scores": {
"exact_match__flexible-extract": {
"value": 1.0
}
}
}
}
}
}
}
Benchmarks Using NeMo Alignment Template (Math Test 500 - NeMo)#
{
"type": "math_test_500_nemo",
"params": {
"limit_samples": 50,
"parallelism": 50,
"request_timeout": 300,
"extra": {
"model_type": "chat",
"num_fewshot": 5
}
}
}
{
"tasks": {
"math_test_500_nemo": {
"metrics": {
"score": {
"scores": {
"micro": {
"value": 0.32,
"stats": {
"stddev": 0.466476151587624,
"stderr": 0.0666394502268034
}
}
}
}
}
}
}
}
Parameters#
Request Parameters#
These parameters control how requests are made to the model:
Name |
Description |
Type |
Default |
Valid Values |
|---|---|---|---|---|
|
Maximum number of retries for failed requests. |
Integer |
Container default |
— |
|
Number of parallel requests to improve throughput. |
Integer |
Container default |
— |
|
Timeout in seconds for each request. |
Integer |
Container default |
— |
|
Limit the number of samples to evaluate. Useful for testing. |
Integer |
|
— |
Model Parameters#
These parameters control the model’s generation behavior:
Name |
Description |
Type |
Default |
Valid Values |
|---|---|---|---|---|
|
Sampling temperature for generation. |
Float |
Container default |
|
|
Nucleus sampling parameter. |
Float |
0.00001 |
|
|
Maximum number of output sequence tokens. |
Integer |
4096 |
— |
Extra Parameters#
Set these parameters in the params.extra section:
Name |
Description |
Type |
Default |
Valid Values |
|---|---|---|---|---|
|
Type of model interface to use. Required for the underlying container, but Evaluator will attempt to guess if not provided. |
String |
Auto-detected |
|
|
HuggingFace token for accessing datasets and tokenizers. Required for tasks that fetch from HuggingFace. |
String |
— |
Valid HF token |
|
Number of examples in few-shot context. |
Integer |
Task-dependent |
— |
|
Ratio for downsampling the dataset. |
Float |
|
|
Extra Judge Parameters#
Set these parameters in the params.extra.judge.model section:
Name |
Description |
Type |
Default |
Valid Values |
|---|---|---|---|---|
|
URL of judge model. |
String |
- |
- |
|
ID of judge model. |
String |
- |
- |
|
API Key to authenticate with judge model |
String |
- |
- |
|
|
String |
|
|
|
Sampling temperature for generation. |
Float |
0.0 |
- |
|
Nucleus sampling parameter. |
Float |
0.00001 |
- |
|
Maximum number of output sequence tokens. |
Integer |
1024 |
- |