Simple Evaluations#

Simple Evaluations are collection of benchmark evaluation types for language models, including various math benchmarks, GPQA variants and MMLU in over 30 languages.

Prerequisites#

Target Configuration

All of the Simple-Evals evaluations require a chat endpoint configuration.

{
  "target": {
    "type": "model", 
    "model": {
      "api_endpoint": {
        "url": "https://<nim-base-url>/v1/chat/completions",
        "model_id": "meta/llama-3.3-70b-instruct"
      }
    }
  }
}

Supported Tasks#

Simple Evaluation Tasks by Category#

Category

Example Task(s)

Description

Advanced Reasoning

gpqa_diamond, gpqa_experts, gpqa_extended, gpqa_main

Graduate-level Q&A tasks.

Language Understanding in Multiple Languages

mmlu_am, mmlu_ar, …, mmlu_yo (see below for full list)

Massive Multitask Language Understanding; covers 57 subjects across STEM, humanities, and more.

Math & Reasoning

AIME_2025, AIME_2024, AA_AIME_2024, AA_math_test_500, math_test_500, simpleqa

Math word problems. SimpleQA has shorter fact-seeking questions.

Benchmarks Using NeMo’s Alignment Template

aime_2025_nemo, aime_2024_nemo, math_test_500_nemo, gpqa_diamond_nemo

Same benchmarks as above but using NeMo’s alignment template format.


Advanced Reasoning (GPQA)#

Simple-Evals includes several GPQA variants. For all of these, include the extra parameter hf_token in order to access the gated dataset Idavidrein/gpqa.

{
    "type": "gpqa_diamond",
    "name": "simple-gpqa_diamond",
    "namespace": "my-organization",
    "params": {
        "limit_samples": 50,
        "parallelism": 50,
        "request_timeout": 300,
        "extra": {
            "model_type": "chat",
            "num_fewshot": 5,
            "hf_token": "hf_XXXXXX"
        }
    }
}
{
  "question": "What is the capital of France?",
  "choices": ["Paris", "London", "Berlin", "Madrid"],
  "answer": "Paris",
  "output": "Paris"
}
{
  "tasks": {
    "gpqa_diamond": {
      "metrics": {
        "score": {
          "scores": {
            "micro": {
              "value": 0.24,
              "stats": {
                "stddev": 0.427083130081253,
                "stderr": 0.0610118757258932
              }
            }
          }
        }
      }
    }
  }
}

Language Understanding (MMLU)#

Simple-Evals offers the MMLU benchmark in a variety of languages.

Multi-language Code Generation Tasks#

Task

Language

mmlu_am

Amharic

mmlu_ar

Arabic

mmlu_bn

Bengali

mmlu_cs

Czech

mmlu_de

German

mmlu_el

Greek

mmlu_en

English (US)

mmlu_es

Spanish (LA)

mmlu_fa

Farsi

mmlu_fil

Filipino

mmlu_fr

French

mmlu_ha

Hausa

mmlu_he

Hebrew

mmlu_hi

Hindi

mmlu_id

bahasa Indonesia

mmlu_ig

Igbo

mmlu_it

Italian

mmlu_ja

Japanese

mmlu_ko

Korean

mmlu_ky

Kyrgyz

mmlu_lt

Lithuanian

mmlu_mg

Malagasy

mmlu_ms

Malay

mmlu_ne

Nepali

mmlu_nl

Dutch

mmlu_ny

Chichewa, also known as Chewa or Nyanja

mmlu_pl

Polish

mmlu_pt

Portuguese (BR)

mmlu_ro

Romanian

mmlu_ru

Russian

mmlu_si

Sinhala

mmlu_sn

Shona

mmlu_so

Somali

mmlu_sr

Serbian

mmlu_sv

Swedish

mmlu_sw

Swahili

mmlu_te

Telugu

mmlu_tr

Turkish

mmlu_uk

Ukrainian

mmlu_vi

Vietnamese

mmlu_yo

Yoruba

{
  "type": "mmlu_am",
  "name": "my-configuration-mmlu_am",
  "namespace": "my-organization",
  "params": {
    "limit_samples": 50,
    "parallelism": 50,
    "request_timeout": 300,
    "top_p": 0.00001,
    "extra": {
      "model_type": "chat",
      "num_fewshot": 5
    }
  }
}

{
  "tasks": {
    "stem": {
      "metrics": {
        "score": {
          "scores": {
            "micro": {
              "value": 0.1,
              "stats": {
                "stderr": 0.0428571428571428
              }
            }
          }
        }
      }
    }
  }
}

Math & Reasoning (Math Test 500)#

{
  "type": "math_test_500",
  "name": "simple-math_test_500",
  "namespace": "my-organization",
  "params": {
    "limit_samples": 50,
    "parallelism": 50,
    "request_timeout": 300,
    "extra": {
      "model_type": "chat",
      "num_fewshot": 5,
      "judge": {
        "model": {
          "api_endpoint": {
            "model_id": "meta/llama-3.2-1b-instruct",
            "url": "https://nim.int.aire.nvidia.com/v1/chat/completions"
          }
        }
      }
    }
  }
}

{
  "tasks": {
    "gpqa_diamond_cot_zeroshot": {
      "metrics": {
        "exact_match__flexible-extract": {
          "scores": {
            "exact_match__flexible-extract": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Benchmarks Using NeMo Alignment Template (Math Test 500 - NeMo)#

{
    "type": "math_test_500_nemo",
    "name": "simple-math_test_500_nemo",
    "namespace": "my-organization",
    "params": {
        "limit_samples": 50,
        "parallelism": 50,
        "request_timeout": 300,
        "extra": {
            "model_type": "chat",
            "num_fewshot": 5
        }
    }
}

{
  "tasks": {
    "math_test_500_nemo": {
      "metrics": {
        "score": {
          "scores": {
            "micro": {
              "value": 0.32,
              "stats": {
                "stddev": 0.466476151587624,
                "stderr": 0.0666394502268034
              }
            }
          }
        }
      }
    }
  }
}

Parameters#

Request Parameters#

These parameters control how requests are made to the model:

Name

Description

Type

Default

Valid Values

max_retries

Maximum number of retries for failed requests.

Integer

Container default

parallelism

Number of parallel requests to improve throughput.

Integer

Container default

request_timeout

Timeout in seconds for each request.

Integer

Container default

limit_samples

Limit the number of samples to evaluate. Useful for testing.

Integer

null (all samples)

Model Parameters#

These parameters control the model’s generation behavior:

Name

Description

Type

Default

Valid Values

temperature

Sampling temperature for generation.

Float

Container default

0.0–2.0

top_p

Nucleus sampling parameter.

Float

0.00001

0.01–1.0

max_new_tokens

Maximum number of output sequence tokens.

Integer

4096

Extra Parameters#

Set these parameters in the params.extra section:

Name

Description

Type

Default

Valid Values

model_type

Type of model interface to use. Required for the underlying container, but Evaluator will attempt to guess if not provided.

String

Auto-detected

"chat" for all Simple-Evals benchmarks

hf_token

HuggingFace token for accessing datasets and tokenizers. Required for tasks that fetch from HuggingFace.

String

Valid HF token

num_fewshot

Number of examples in few-shot context.

Integer

Task-dependent

downsampling_ratio

Ratio for downsampling the dataset.

Float

null (no downsampling)

0.0–1.0

Extra Judge Parameters#

Set these parameters in the params.extra.judge.model section:

Name

Description

Type

Default

Valid Values

api_endpoint.url

URL of judge model.

String

-

-

api_endpoint.model_id

ID of judge model.

String

-

-

api_endpoint.api_key

API Key to authenticate with judge model

String

-

-

api_endpoint.format

"openai" for OpenAI compatible judges; "nim" for direct calls via aiohttp

String

"nim"

"openai", "nim"

prompt.inference_params.temperature

Sampling temperature for generation.

Float

0.0

-

prompt.inference_params.top_p

Nucleus sampling parameter.

Float

0.00001

-

prompt.inference_params.max_tokens

Maximum number of output sequence tokens.

Integer

1024

-