Simple Evaluations#
Simple Evaluations are collection of benchmark evaluation types for language models, including various math benchmarks, GPQA variants and MMLU in over 30 languages.
Prerequisites#
Set up or select an existing evaluation target.
Target Configuration
All of the Simple-Evals evaluations require a chat endpoint configuration.
{
"target": {
"type": "model",
"model": {
"api_endpoint": {
"url": "https://<nim-base-url>/v1/chat/completions",
"model_id": "meta/llama-3.3-70b-instruct"
}
}
}
}
Supported Tasks#
Category |
Example Task(s) |
Description |
---|---|---|
Advanced Reasoning |
|
Graduate-level Q&A tasks. |
Language Understanding in Multiple Languages |
|
Massive Multitask Language Understanding; covers 57 subjects across STEM, humanities, and more. |
Math & Reasoning |
|
Math word problems. SimpleQA has shorter fact-seeking questions. |
Benchmarks Using NeMo’s Alignment Template |
|
Same benchmarks as above but using NeMo’s alignment template format. |
Advanced Reasoning (GPQA)#
Simple-Evals includes several GPQA variants. For all of these, include the extra parameter hf_token
in order to access the gated dataset Idavidrein/gpqa
.
{
"type": "gpqa_diamond",
"name": "simple-gpqa_diamond",
"namespace": "my-organization",
"params": {
"limit_samples": 50,
"parallelism": 50,
"request_timeout": 300,
"extra": {
"model_type": "chat",
"num_fewshot": 5,
"hf_token": "hf_XXXXXX"
}
}
}
{
"question": "What is the capital of France?",
"choices": ["Paris", "London", "Berlin", "Madrid"],
"answer": "Paris",
"output": "Paris"
}
{
"tasks": {
"gpqa_diamond": {
"metrics": {
"score": {
"scores": {
"micro": {
"value": 0.24,
"stats": {
"stddev": 0.427083130081253,
"stderr": 0.0610118757258932
}
}
}
}
}
}
}
}
Language Understanding (MMLU)#
Simple-Evals offers the MMLU benchmark in a variety of languages.
Task |
Language |
---|---|
|
Amharic |
|
Arabic |
|
Bengali |
|
Czech |
|
German |
|
Greek |
|
English (US) |
|
Spanish (LA) |
|
Farsi |
|
Filipino |
|
French |
|
Hausa |
|
Hebrew |
|
Hindi |
|
bahasa Indonesia |
|
Igbo |
|
Italian |
|
Japanese |
|
Korean |
|
Kyrgyz |
|
Lithuanian |
|
Malagasy |
|
Malay |
|
Nepali |
|
Dutch |
|
Chichewa, also known as Chewa or Nyanja |
|
Polish |
|
Portuguese (BR) |
|
Romanian |
|
Russian |
|
Sinhala |
|
Shona |
|
Somali |
|
Serbian |
|
Swedish |
|
Swahili |
|
Telugu |
|
Turkish |
|
Ukrainian |
|
Vietnamese |
|
Yoruba |
{
"type": "mmlu_am",
"name": "my-configuration-mmlu_am",
"namespace": "my-organization",
"params": {
"limit_samples": 50,
"parallelism": 50,
"request_timeout": 300,
"top_p": 0.00001,
"extra": {
"model_type": "chat",
"num_fewshot": 5
}
}
}
{
"tasks": {
"stem": {
"metrics": {
"score": {
"scores": {
"micro": {
"value": 0.1,
"stats": {
"stderr": 0.0428571428571428
}
}
}
}
}
}
}
}
Math & Reasoning (Math Test 500)#
{
"type": "math_test_500",
"name": "simple-math_test_500",
"namespace": "my-organization",
"params": {
"limit_samples": 50,
"parallelism": 50,
"request_timeout": 300,
"extra": {
"model_type": "chat",
"num_fewshot": 5,
"judge": {
"model": {
"api_endpoint": {
"model_id": "meta/llama-3.2-1b-instruct",
"url": "https://nim.int.aire.nvidia.com/v1/chat/completions"
}
}
}
}
}
}
{
"tasks": {
"gpqa_diamond_cot_zeroshot": {
"metrics": {
"exact_match__flexible-extract": {
"scores": {
"exact_match__flexible-extract": {
"value": 1.0
}
}
}
}
}
}
}
Benchmarks Using NeMo Alignment Template (Math Test 500 - NeMo)#
{
"type": "math_test_500_nemo",
"name": "simple-math_test_500_nemo",
"namespace": "my-organization",
"params": {
"limit_samples": 50,
"parallelism": 50,
"request_timeout": 300,
"extra": {
"model_type": "chat",
"num_fewshot": 5
}
}
}
{
"tasks": {
"math_test_500_nemo": {
"metrics": {
"score": {
"scores": {
"micro": {
"value": 0.32,
"stats": {
"stddev": 0.466476151587624,
"stderr": 0.0666394502268034
}
}
}
}
}
}
}
}
Parameters#
Request Parameters#
These parameters control how requests are made to the model:
Name |
Description |
Type |
Default |
Valid Values |
---|---|---|---|---|
|
Maximum number of retries for failed requests. |
Integer |
Container default |
— |
|
Number of parallel requests to improve throughput. |
Integer |
Container default |
— |
|
Timeout in seconds for each request. |
Integer |
Container default |
— |
|
Limit the number of samples to evaluate. Useful for testing. |
Integer |
|
— |
Model Parameters#
These parameters control the model’s generation behavior:
Name |
Description |
Type |
Default |
Valid Values |
---|---|---|---|---|
|
Sampling temperature for generation. |
Float |
Container default |
|
|
Nucleus sampling parameter. |
Float |
0.00001 |
|
|
Maximum number of output sequence tokens. |
Integer |
4096 |
— |
Extra Parameters#
Set these parameters in the params.extra
section:
Name |
Description |
Type |
Default |
Valid Values |
---|---|---|---|---|
|
Type of model interface to use. Required for the underlying container, but Evaluator will attempt to guess if not provided. |
String |
Auto-detected |
|
|
HuggingFace token for accessing datasets and tokenizers. Required for tasks that fetch from HuggingFace. |
String |
— |
Valid HF token |
|
Number of examples in few-shot context. |
Integer |
Task-dependent |
— |
|
Ratio for downsampling the dataset. |
Float |
|
|
Extra Judge Parameters#
Set these parameters in the params.extra.judge.model
section:
Name |
Description |
Type |
Default |
Valid Values |
---|---|---|---|---|
|
URL of judge model. |
String |
- |
- |
|
ID of judge model. |
String |
- |
- |
|
API Key to authenticate with judge model |
String |
- |
- |
|
|
String |
|
|
|
Sampling temperature for generation. |
Float |
0.0 |
- |
|
Nucleus sampling parameter. |
Float |
0.00001 |
- |
|
Maximum number of output sequence tokens. |
Integer |
1024 |
- |