Simple Evaluations#

Simple Evaluations are collection of benchmark evaluation types for language models, including various math benchmarks, GPQA variants and MMLU in over 30 languages.

Tip

Want to experiment first? You can try these benchmarks using the open-source NeMo Evaluator SDK before deploying the microservice. The SDK provides a lightweight way to test evaluation workflows locally.

Target Configuration

All of the Simple-Evals evaluations require a chat endpoint configuration.

{
  "target": {
    "type": "model", 
    "model": {
      "api_endpoint": {
        "url": "https://<nim-base-url>/v1/chat/completions",
        "model_id": "meta/llama-3.3-70b-instruct"
      }
    }
  }
}

Example Job Execution#

You can execute an Evaluation Job using either the Python SDK or cURL as follows, replacing <my-eval-config> with configs shown on this page:

Note

See Job Target and Configuration Matrix for details on target / config compatibility.

Python SDK v2

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.v2.evaluation.jobs.create(
    spec={
        "target": {
            "type": "model",
            "name": "my-target-dataset-1",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            },
        },
        "config": <my-eval-config>
    }
)

cURL v2

curl -X "POST" "$EVALUATOR_BASE_URL/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "spec": {
            "target": {
                "type": "model",
                "name": "my-target-dataset-1",
                "namespace": "my-organization",
                "model": {
                    "api_endpoint": {
                        # Replace NIM_BASE_URL with your specific deployment
                        "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                        "model_id": "meta/llama-3.1-8b-instruct"
                    }
                }
            },
            "config": <my-eval-config>
        }
    }'

Python SDK v1

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.evaluation.jobs.create(
    namespace="my-organization",
    target={
        "type": "model",
        "namespace": "my-organization",
        "model": {
            "api_endpoint": {
                # Replace NIM_BASE_URL with your specific deployment
                "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                "model_id": "meta/llama-3.1-8b-instruct"
            }
        },
    },
    config=<my-eval-config>
)

cURL v1

curl -X "POST" "$EVALUATOR_BASE_URL/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": {
            "type": "model",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            }
        },
        "config": <my-eval-config>
    }'

For a full example, see Run an Academic LM Harness Eval

Supported Tasks#

Simple Evaluation Tasks by Category#
Category	Example Task(s)	Description
Advanced Reasoning	`gpqa_diamond`, `gpqa_experts`, `gpqa_extended`, `gpqa_main`	Graduate-level Q&A tasks.
Language Understanding in Multiple Languages	`mmlu_am`, `mmlu_ar`, …, `mmlu_yo` (see below for full list)	Massive Multitask Language Understanding; covers 57 subjects across STEM, humanities, and more.
Math & Reasoning	`AIME_2025`, `AIME_2024`, `AA_AIME_2024`, `AA_math_test_500`, `math_test_500`, `simpleqa`	Math word problems. SimpleQA has shorter fact-seeking questions.
Benchmarks Using NeMo’s Alignment Template	`aime_2025_nemo`, `aime_2024_nemo`, `math_test_500_nemo`, `gpqa_diamond_nemo`	Same benchmarks as above but using NeMo’s alignment template format.

Advanced Reasoning (GPQA)#

Simple-Evals includes several GPQA variants. For all of these, include the extra parameter hf_token in order to access the gated dataset Idavidrein/gpqa.

Configuration

{
    "type": "gpqa_diamond",
    "params": {
        "limit_samples": 50,
        "parallelism": 50,
        "request_timeout": 300,
        "extra": {
            "model_type": "chat",
            "num_fewshot": 5,
            "hf_token": "hf_XXXXXX"
        }
    }
}

Data Format

{
  "question": "What is the capital of France?",
  "choices": ["Paris", "London", "Berlin", "Madrid"],
  "answer": "Paris",
  "output": "Paris"
}

Result

{
  "tasks": {
    "gpqa_diamond": {
      "metrics": {
        "score": {
          "scores": {
            "micro": {
              "value": 0.24,
              "stats": {
                "stddev": 0.427083130081253,
                "stderr": 0.0610118757258932
              }
            }
          }
        }
      }
    }
  }
}

Note

The GPQA evaluation in simple-eval uses a different protocol than the LM Evaluation Harness GPQA evaluation:

Simple-eval: Employs a zero-shot, chain-of-thought approach with straightforward instructions
LM Evaluation Harness: Uses a few-shot evaluation framework

Because of these methodological differences, the same model can produce significantly different accuracy scores. For this reason, GPQA (simple-eval) and GPQA (lm-eval-harness) are treated as distinct, non-comparable metrics.

Language Understanding (MMLU)#

Simple-Evals offers the MMLU benchmark in a variety of languages.

Multi-language Code Generation Tasks#
Task	Language
`mmlu_am`	Amharic
`mmlu_ar`	Arabic
`mmlu_bn`	Bengali
`mmlu_cs`	Czech
`mmlu_de`	German
`mmlu_el`	Greek
`mmlu_en`	English (US)
`mmlu_es`	Spanish (LA)
`mmlu_fa`	Farsi
`mmlu_fil`	Filipino
`mmlu_fr`	French
`mmlu_ha`	Hausa
`mmlu_he`	Hebrew
`mmlu_hi`	Hindi
`mmlu_id`	bahasa Indonesia
`mmlu_ig`	Igbo
`mmlu_it`	Italian
`mmlu_ja`	Japanese
`mmlu_ko`	Korean
`mmlu_ky`	Kyrgyz
`mmlu_lt`	Lithuanian
`mmlu_mg`	Malagasy
`mmlu_ms`	Malay
`mmlu_ne`	Nepali
`mmlu_nl`	Dutch
`mmlu_ny`	Chichewa, also known as Chewa or Nyanja
`mmlu_pl`	Polish
`mmlu_pt`	Portuguese (BR)
`mmlu_ro`	Romanian
`mmlu_ru`	Russian
`mmlu_si`	Sinhala
`mmlu_sn`	Shona
`mmlu_so`	Somali
`mmlu_sr`	Serbian
`mmlu_sv`	Swedish
`mmlu_sw`	Swahili
`mmlu_te`	Telugu
`mmlu_tr`	Turkish
`mmlu_uk`	Ukrainian
`mmlu_vi`	Vietnamese
`mmlu_yo`	Yoruba

Configuration

{
  "type": "mmlu_am",
  "params": {
    "limit_samples": 50,
    "parallelism": 50,
    "request_timeout": 300,
    "top_p": 0.00001,
    "extra": {
      "model_type": "chat",
      "num_fewshot": 5
    }
  }
}

Data Format

Result

{
  "tasks": {
    "stem": {
      "metrics": {
        "score": {
          "scores": {
            "micro": {
              "value": 0.1,
              "stats": {
                "stderr": 0.0428571428571428
              }
            }
          }
        }
      }
    }
  }
}

Math & Reasoning (Math Test 500)#

Configuration

{
  "type": "math_test_500",
  "params": {
    "limit_samples": 50,
    "parallelism": 50,
    "request_timeout": 300,
    "extra": {
      "model_type": "chat",
      "num_fewshot": 5,
      "judge": {
        "model": {
          "api_endpoint": {
            "model_id": "meta/llama-3.3-70b-instruct",
            "url": "https://<nim-base-url>/v1/chat/completions"
          }
        }
      }
    }
  }
}

The judge model should be at least 70B parameters, otherwise metrics evaluation might fail due to judge output not matching the specified metric template. Visit Troubleshooting Unsupported Judge Model for more details.

Data Format

Result

{
  "tasks": {
    "gpqa_diamond_cot_zeroshot": {
      "metrics": {
        "exact_match__flexible-extract": {
          "scores": {
            "exact_match__flexible-extract": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Benchmarks Using NeMo Alignment Template (Math Test 500 - NeMo)#

Configuration

{
    "type": "math_test_500_nemo",
    "params": {
        "limit_samples": 50,
        "parallelism": 50,
        "request_timeout": 300,
        "extra": {
            "model_type": "chat",
            "num_fewshot": 5
        }
    }
}

Data Format

Result

{
  "tasks": {
    "math_test_500_nemo": {
      "metrics": {
        "score": {
          "scores": {
            "micro": {
              "value": 0.32,
              "stats": {
                "stddev": 0.466476151587624,
                "stderr": 0.0666394502268034
              }
            }
          }
        }
      }
    }
  }
}

Parameters#

Request Parameters#

These parameters control how requests are made to the model:

Name	Description	Type	Default	Valid Values
`max_retries`	Maximum number of retries for failed requests.	Integer	Container default	—
`parallelism`	Number of parallel requests to improve throughput.	Integer	Container default	—
`request_timeout`	Timeout in seconds for each request.	Integer	Container default	—
`limit_samples`	Limit the number of samples to evaluate. Useful for testing.	Integer	`null` (all samples)	—

Model Parameters#

These parameters control the model’s generation behavior:

Name	Description	Type	Default	Valid Values
`temperature`	Sampling temperature for generation.	Float	Container default	`0.0–2.0`
`top_p`	Nucleus sampling parameter.	Float	0.00001	`0.01–1.0`
`max_new_tokens`	Maximum number of output sequence tokens.	Integer	4096	—

Extra Parameters#

Set these parameters in the params.extra section:

Name	Description	Type	Default	Valid Values
`model_type`	Type of model interface to use. Required for the underlying container, but Evaluator will attempt to guess if not provided.	String	Auto-detected	`"chat"` for all Simple-Evals benchmarks
`hf_token`	HuggingFace token for accessing datasets and tokenizers. Required for tasks that fetch from HuggingFace.	String	—	Valid HF token
`num_fewshot`	Number of examples in few-shot context.	Integer	Task-dependent	—
`downsampling_ratio`	Ratio for downsampling the dataset.	Float	`null` (no downsampling)	`0.0–1.0`

Extra Judge Parameters#

Set these parameters in the params.extra.judge.model section:

Name	Description	Type	Default	Valid Values
`api_endpoint.url`	URL of judge model.	String	-	-
`api_endpoint.model_id`	ID of judge model.	String	-	-
`api_endpoint.api_key`	API Key to authenticate with judge model	String	-	-
`api_endpoint.format`	`"openai"` for OpenAI compatible judges; `"nim"` for direct calls via aiohttp	String	`"nim"`	`"openai"`, `"nim"`
`prompt.inference_params.temperature`	Sampling temperature for generation.	Float	0.0	-
`prompt.inference_params.top_p`	Nucleus sampling parameter.	Float	0.00001	-
`prompt.inference_params.max_tokens`	Maximum number of output sequence tokens.	Integer	1024	-