simple_evals#

This page contains all evaluation tasks for the simple_evals harness.

Task

Description

AA_AIME_2024

AIME 2024 questions, math, using Artificial Analysis’s setup.

AA_math_test_500

Open Ai math test 500, using Artificial Analysis’s setup.

AIME_2024

AIME 2024 questions, math

AIME_2025

AIME 2025 questions, math

AIME_2025_aa_v2

AIME 2025 questions, math - params aligned with Artificial Analysis Index v2

aime_2024_nemo

AIME 2024 questions, math, using NeMo’s alignment template

aime_2025_nemo

AIME 2025 questions, math, using NeMo’s alignment template

browsecomp

BrowseComp is a benchmark for measuring the ability for agents to browse the web.

gpqa_diamond

gpqa_diamond 0-shot CoT

gpqa_diamond_aa_v2

gpqa_diamond questions with custom regex extraction patterns for AA v2

gpqa_diamond_aa_v2_llama_4

gpqa_diamond questions with custom regex extraction patterns for Llama 4

gpqa_diamond_aa_v3

GPQA Diamond with AA v3 methodology - multi-stage regex extraction for robust answer parsing

gpqa_diamond_nemo

gpqa_diamond questions, reasoning, using NeMo’s alignment template

gpqa_extended

gpqa_extended 0-shot CoT

gpqa_main

gpqa_main 0-shot CoT

healthbench

HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare.

healthbench_consensus

HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The consensus subset measures 34 particularly important aspects of model behavior and has been validated by the consensus of multiple physicians.

healthbench_hard

HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The hard subset consists of 1000 examples chosen because they are difficult for current frontier models.

humaneval

HumanEval evaluates the performance in Python code generation tasks. It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

humanevalplus

HumanEvalPlus is a dataset of 164 programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

math_test_500

Open AI math test 500

math_test_500_nemo

math_test_500 questions, math, using NeMo’s alignment template

mgsm

MGSM is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages.

mgsm_aa_v2

MGSM is a benchmark of grade-school math problems - params aligned with Artificial Analysis Index v2

mmlu

MMLU 0-shot CoT

mmlu_am

Global-MMLU 0-shot CoT in Amharic (am)

mmlu_ar

Global-MMLU 0-shot CoT in Arabic (ar)

mmlu_ar-lite

Global-MMLU-Lite 0-shot CoT in Arabic (ar)

mmlu_bn

Global-MMLU 0-shot CoT in Bengali (bn)

mmlu_bn-lite

Global-MMLU-Lite 0-shot CoT in Bengali (bn)

mmlu_cs

Global-MMLU 0-shot CoT in Czech (cs)

mmlu_de

Global-MMLU 0-shot CoT in German (de)

mmlu_de-lite

Global-MMLU-Lite 0-shot CoT in German (de)

mmlu_el

Global-MMLU 0-shot CoT in Greek (el)

mmlu_en

Global-MMLU 0-shot CoT in English (en)

mmlu_en-lite

Global-MMLU-Lite 0-shot CoT in English (en)

mmlu_es

Global-MMLU 0-shot CoT in Spanish (es)

mmlu_es-lite

Global-MMLU-Lite 0-shot CoT in Spanish (es)

mmlu_fa

Global-MMLU 0-shot CoT in Persian (fa)

mmlu_fil

Global-MMLU 0-shot CoT in Filipino (fil)

mmlu_fr

Global-MMLU 0-shot CoT in French (fr)

mmlu_fr-lite

Global-MMLU-Lite 0-shot CoT in French (fr)

mmlu_ha

Global-MMLU 0-shot CoT in Hausa (ha)

mmlu_he

Global-MMLU 0-shot CoT in Hebrew (he)

mmlu_hi

Global-MMLU 0-shot CoT in Hindi (hi)

mmlu_hi-lite

Global-MMLU-Lite 0-shot CoT in Hindi (hi)

mmlu_id

Global-MMLU 0-shot CoT in Indonesian (id)

mmlu_id-lite

Global-MMLU-Lite 0-shot CoT in Indonesian (id)

mmlu_ig

Global-MMLU 0-shot CoT in Igbo (ig)

mmlu_it

Global-MMLU 0-shot CoT in Italian (it)

mmlu_it-lite

Global-MMLU-Lite 0-shot CoT in Italian (it)

mmlu_ja

Global-MMLU 0-shot CoT in Japanese (ja)

mmlu_ja-lite

Global-MMLU-Lite 0-shot CoT in Japanese (ja)

mmlu_ko

Global-MMLU 0-shot CoT in Korean (ko)

mmlu_ko-lite

Global-MMLU-Lite 0-shot CoT in Korean (ko)

mmlu_ky

Global-MMLU 0-shot CoT in Kyrgyz (ky)

mmlu_llama_4

MMLU questions with custom regex extraction patterns for Llama 4

mmlu_lt

Global-MMLU 0-shot CoT in Lithuanian (lt)

mmlu_mg

Global-MMLU 0-shot CoT in Malagasy (mg)

mmlu_ms

Global-MMLU 0-shot CoT in Malay (ms)

mmlu_my-lite

Global-MMLU-Lite 0-shot CoT in Malay (my)

mmlu_ne

Global-MMLU 0-shot CoT in Nepali (ne)

mmlu_nl

Global-MMLU 0-shot CoT in Dutch (nl)

mmlu_ny

Global-MMLU 0-shot CoT in Nyanja (ny)

mmlu_pl

Global-MMLU 0-shot CoT in Polish (pl)

mmlu_pro

MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models’ capabilities. This dataset contains 12K complex questions across various disciplines.

mmlu_pro_aa_v2

MMLU-Pro - params aligned with Artificial Analysis Index v2

mmlu_pro_aa_v3

MMLU-Pro with AA v3 methodology - multi-stage regex extraction with A-J options

mmlu_pro_llama_4

MMLU-Pro questions with custom regex extraction patterns for Llama 4

mmlu_pt

Global-MMLU 0-shot CoT in Portuguese (pt)

mmlu_pt-lite

Global-MMLU-Lite 0-shot CoT in Portuguese (pt)

mmlu_ro

Global-MMLU 0-shot CoT in Romanian (ro)

mmlu_ru

Global-MMLU 0-shot CoT in Russian (ru)

mmlu_si

Global-MMLU 0-shot CoT in Sinhala (si)

mmlu_sn

Global-MMLU 0-shot CoT in Shona (sn)

mmlu_so

Global-MMLU 0-shot CoT in Somali (so)

mmlu_sr

Global-MMLU 0-shot CoT in Serbian (sr)

mmlu_sv

Global-MMLU 0-shot CoT in Swedish (sv)

mmlu_sw

Global-MMLU 0-shot CoT in Swahili (sw)

mmlu_sw-lite

Global-MMLU-Lite 0-shot CoT in Swahili (sw)

mmlu_te

Global-MMLU 0-shot CoT in Telugu (te)

mmlu_tr

Global-MMLU 0-shot CoT in Turkish (tr)

mmlu_uk

Global-MMLU 0-shot CoT in Ukrainian (uk)

mmlu_vi

Global-MMLU 0-shot CoT in Vietnamese (vi)

mmlu_yo

Global-MMLU 0-shot CoT in Yoruba (yo)

mmlu_yo-lite

Global-MMLU-Lite 0-shot CoT in Yoruba (yo)

mmlu_zh-lite

Global-MMLU-Lite 0-shot CoT in Chinese (Simplified) (zh)

simpleqa

A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.

AA_AIME_2024#

AIME 2024 questions, math, using Artificial Analysis’s setup.

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: AA_AIME_2024

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: AA_AIME_2024
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: AA_AIME_2024
target:
  api_endpoint: {}

AA_math_test_500#

Open Ai math test 500, using Artificial Analysis’s setup.

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: AA_math_test_500

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: AA_math_test_500
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 3
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: AA_math_test_500
target:
  api_endpoint: {}

AIME_2024#

AIME 2024 questions, math

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: AIME_2024

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: AIME_2024
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: AIME_2024
target:
  api_endpoint: {}

AIME_2025#

AIME 2025 questions, math

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: AIME_2025

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: AIME_2025
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: AIME_2025
target:
  api_endpoint: {}

AIME_2025_aa_v2#

AIME 2025 questions, math - params aligned with Artificial Analysis Index v2

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: AIME_2025_aa_v2

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 10
    task: AIME_2025
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: AIME_2025_aa_v2
target:
  api_endpoint: {}

aime_2024_nemo#

AIME 2024 questions, math, using NeMo’s alignment template

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: aime_2024_nemo

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: aime_2024_nemo
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: aime_2024_nemo
target:
  api_endpoint: {}

aime_2025_nemo#

AIME 2025 questions, math, using NeMo’s alignment template

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: aime_2025_nemo

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: aime_2025_nemo
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: aime_2025_nemo
target:
  api_endpoint: {}

browsecomp#

BrowseComp is a benchmark for measuring the ability for agents to browse the web.

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: browsecomp

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: browsecomp
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: browsecomp
target:
  api_endpoint: {}

gpqa_diamond#

gpqa_diamond 0-shot CoT

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: gpqa_diamond

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: gpqa_diamond
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: gpqa_diamond
target:
  api_endpoint: {}

gpqa_diamond_aa_v2#

gpqa_diamond questions with custom regex extraction patterns for AA v2

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: gpqa_diamond_aa_v2

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 10
    task: gpqa_diamond
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 5
      downsampling_ratio: null
      add_system_prompt: false
      custom_config:
        extraction:
        - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
          match_group: 1
          name: aa_v2_regex
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: gpqa_diamond_aa_v2
target:
  api_endpoint: {}

gpqa_diamond_aa_v2_llama_4#

gpqa_diamond questions with custom regex extraction patterns for Llama 4

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: gpqa_diamond_aa_v2_llama_4

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: gpqa_diamond
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 5
      downsampling_ratio: null
      add_system_prompt: false
      custom_config:
        extraction:
        - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
          match_group: 1
          name: answer_colon_llama4
        - regex: (?i)(?:the )?best? answer is\s*[\*\_,{}\.]*([A-D])(?![a-zA-Z0-9])
          match_group: 1
          name: answer_is_llama4
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: gpqa_diamond_aa_v2_llama_4
target:
  api_endpoint: {}

gpqa_diamond_aa_v3#

GPQA Diamond with AA v3 methodology - multi-stage regex extraction for robust answer parsing

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: gpqa_diamond_aa_v3

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 10
    task: gpqa_diamond
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 5
      downsampling_ratio: null
      add_system_prompt: false
      custom_config:
        prompt_template: 'Answer the following multiple choice question. The last line of your response should be in the following
          format: ''Answer: A/B/C/D'' (e.g. ''Answer: A'').


          {Question}


          A) {A}

          B) {B}

          C) {C}

          D) {D}

          '
        extraction:
        - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
          match_group: 1
          name: primary_answer_format
        - regex: \\boxed\{[^}]*([A-Z])[^}]*\}
          match_group: 1
          name: latex_boxed
        - regex: answer is ([a-zA-Z])
          match_group: 1
          name: natural_language
        - regex: answer is \(([a-zA-Z])\)
          match_group: 1
          name: with_parenthesis
        - regex: ([A-Z])\)\s*[^A-Z]*
          match_group: 1
          name: choice_format
        - regex: ([A-Z])\s+is\s+the\s+correct\s+answer
          match_group: 1
          name: explicit_statement
        - regex: ([A-Z])\s*$
          match_group: 1
          name: standalone_letter_end
        - regex: ([A-Z])\s*\.
          match_group: 1
          name: letter_with_period
        - regex: ([A-Z])\s*[^\w]
          match_group: 1
          name: letter_nonword
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: gpqa_diamond_aa_v3
target:
  api_endpoint: {}

gpqa_diamond_nemo#

gpqa_diamond questions, reasoning, using NeMo’s alignment template

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: gpqa_diamond_nemo

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: gpqa_diamond_nemo
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 5
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: gpqa_diamond_nemo
target:
  api_endpoint: {}

gpqa_extended#

gpqa_extended 0-shot CoT

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: gpqa_extended

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: gpqa_extended
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: gpqa_extended
target:
  api_endpoint: {}

gpqa_main#

gpqa_main 0-shot CoT

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: gpqa_main

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: gpqa_main
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: gpqa_main
target:
  api_endpoint: {}

healthbench#

HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare.

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: healthbench

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: healthbench
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: healthbench
target:
  api_endpoint: {}

healthbench_consensus#

HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The consensus subset measures 34 particularly important aspects of model behavior and has been validated by the consensus of multiple physicians.

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: healthbench_consensus

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: healthbench_consensus
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: healthbench_consensus
target:
  api_endpoint: {}

healthbench_hard#

HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The hard subset consists of 1000 examples chosen because they are difficult for current frontier models.

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: healthbench_hard

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: healthbench_hard
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: healthbench_hard
target:
  api_endpoint: {}

humaneval#

HumanEval evaluates the performance in Python code generation tasks. It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: humaneval

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: humaneval
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: humaneval
target:
  api_endpoint: {}

humanevalplus#

HumanEvalPlus is a dataset of 164 programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: humanevalplus

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: humanevalplus
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: humanevalplus
target:
  api_endpoint: {}

math_test_500#

Open AI math test 500

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: math_test_500

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: math_test_500
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: math_test_500
target:
  api_endpoint: {}

math_test_500_nemo#

math_test_500 questions, math, using NeMo’s alignment template

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: math_test_500_nemo

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: math_test_500_nemo
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 3
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: math_test_500_nemo
target:
  api_endpoint: {}

mgsm#

MGSM is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages.

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mgsm

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mgsm
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mgsm
target:
  api_endpoint: {}

mgsm_aa_v2#

MGSM is a benchmark of grade-school math problems - params aligned with Artificial Analysis Index v2

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mgsm_aa_v2

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 10
    task: mgsm
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mgsm_aa_v2
target:
  api_endpoint: {}

mmlu#

MMLU 0-shot CoT

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu
target:
  api_endpoint: {}

mmlu_am#

Global-MMLU 0-shot CoT in Amharic (am)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_am

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_am
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_am
target:
  api_endpoint: {}

mmlu_ar#

Global-MMLU 0-shot CoT in Arabic (ar)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ar

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ar
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ar
target:
  api_endpoint: {}

mmlu_ar-lite#

Global-MMLU-Lite 0-shot CoT in Arabic (ar)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ar-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ar-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ar-lite
target:
  api_endpoint: {}

mmlu_bn#

Global-MMLU 0-shot CoT in Bengali (bn)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_bn

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_bn
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_bn
target:
  api_endpoint: {}

mmlu_bn-lite#

Global-MMLU-Lite 0-shot CoT in Bengali (bn)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_bn-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_bn-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_bn-lite
target:
  api_endpoint: {}

mmlu_cs#

Global-MMLU 0-shot CoT in Czech (cs)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_cs

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_cs
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_cs
target:
  api_endpoint: {}

mmlu_de#

Global-MMLU 0-shot CoT in German (de)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_de

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_de
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_de
target:
  api_endpoint: {}

mmlu_de-lite#

Global-MMLU-Lite 0-shot CoT in German (de)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_de-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_de-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_de-lite
target:
  api_endpoint: {}

mmlu_el#

Global-MMLU 0-shot CoT in Greek (el)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_el

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_el
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_el
target:
  api_endpoint: {}

mmlu_en#

Global-MMLU 0-shot CoT in English (en)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_en

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_en
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_en
target:
  api_endpoint: {}

mmlu_en-lite#

Global-MMLU-Lite 0-shot CoT in English (en)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_en-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_en-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_en-lite
target:
  api_endpoint: {}

mmlu_es#

Global-MMLU 0-shot CoT in Spanish (es)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_es

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_es
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_es
target:
  api_endpoint: {}

mmlu_es-lite#

Global-MMLU-Lite 0-shot CoT in Spanish (es)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_es-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_es-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_es-lite
target:
  api_endpoint: {}

mmlu_fa#

Global-MMLU 0-shot CoT in Persian (fa)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_fa

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_fa
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_fa
target:
  api_endpoint: {}

mmlu_fil#

Global-MMLU 0-shot CoT in Filipino (fil)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_fil

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_fil
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_fil
target:
  api_endpoint: {}

mmlu_fr#

Global-MMLU 0-shot CoT in French (fr)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_fr

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_fr
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_fr
target:
  api_endpoint: {}

mmlu_fr-lite#

Global-MMLU-Lite 0-shot CoT in French (fr)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_fr-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_fr-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_fr-lite
target:
  api_endpoint: {}

mmlu_ha#

Global-MMLU 0-shot CoT in Hausa (ha)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ha

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ha
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ha
target:
  api_endpoint: {}

mmlu_he#

Global-MMLU 0-shot CoT in Hebrew (he)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_he

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_he
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_he
target:
  api_endpoint: {}

mmlu_hi#

Global-MMLU 0-shot CoT in Hindi (hi)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_hi

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_hi
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_hi
target:
  api_endpoint: {}

mmlu_hi-lite#

Global-MMLU-Lite 0-shot CoT in Hindi (hi)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_hi-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_hi-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_hi-lite
target:
  api_endpoint: {}

mmlu_id#

Global-MMLU 0-shot CoT in Indonesian (id)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_id

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_id
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_id
target:
  api_endpoint: {}

mmlu_id-lite#

Global-MMLU-Lite 0-shot CoT in Indonesian (id)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_id-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_id-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_id-lite
target:
  api_endpoint: {}

mmlu_ig#

Global-MMLU 0-shot CoT in Igbo (ig)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ig

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ig
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ig
target:
  api_endpoint: {}

mmlu_it#

Global-MMLU 0-shot CoT in Italian (it)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_it

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_it
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_it
target:
  api_endpoint: {}

mmlu_it-lite#

Global-MMLU-Lite 0-shot CoT in Italian (it)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_it-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_it-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_it-lite
target:
  api_endpoint: {}

mmlu_ja#

Global-MMLU 0-shot CoT in Japanese (ja)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ja

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ja
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ja
target:
  api_endpoint: {}

mmlu_ja-lite#

Global-MMLU-Lite 0-shot CoT in Japanese (ja)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ja-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ja-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ja-lite
target:
  api_endpoint: {}

mmlu_ko#

Global-MMLU 0-shot CoT in Korean (ko)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ko

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ko
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ko
target:
  api_endpoint: {}

mmlu_ko-lite#

Global-MMLU-Lite 0-shot CoT in Korean (ko)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ko-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ko-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ko-lite
target:
  api_endpoint: {}

mmlu_ky#

Global-MMLU 0-shot CoT in Kyrgyz (ky)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ky

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ky
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ky
target:
  api_endpoint: {}

mmlu_llama_4#

MMLU questions with custom regex extraction patterns for Llama 4

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_llama_4

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config:
        extraction:
        - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
          match_group: 1
          name: answer_colon_llama4
        - regex: (?i)(?:the )?best? answer is\s*[\*\_,{}\.]*([A-D])(?![a-zA-Z0-9])
          match_group: 1
          name: answer_is_llama4
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_llama_4
target:
  api_endpoint: {}

mmlu_lt#

Global-MMLU 0-shot CoT in Lithuanian (lt)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_lt

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_lt
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_lt
target:
  api_endpoint: {}

mmlu_mg#

Global-MMLU 0-shot CoT in Malagasy (mg)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_mg

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_mg
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_mg
target:
  api_endpoint: {}

mmlu_ms#

Global-MMLU 0-shot CoT in Malay (ms)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ms

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ms
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ms
target:
  api_endpoint: {}

mmlu_my-lite#

Global-MMLU-Lite 0-shot CoT in Malay (my)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_my-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_my-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_my-lite
target:
  api_endpoint: {}

mmlu_ne#

Global-MMLU 0-shot CoT in Nepali (ne)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ne

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ne
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ne
target:
  api_endpoint: {}

mmlu_nl#

Global-MMLU 0-shot CoT in Dutch (nl)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_nl

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_nl
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_nl
target:
  api_endpoint: {}

mmlu_ny#

Global-MMLU 0-shot CoT in Nyanja (ny)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ny

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ny
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ny
target:
  api_endpoint: {}

mmlu_pl#

Global-MMLU 0-shot CoT in Polish (pl)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_pl

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_pl
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_pl
target:
  api_endpoint: {}

mmlu_pro#

MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models’ capabilities. This dataset contains 12K complex questions across various disciplines.

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_pro

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_pro
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_pro
target:
  api_endpoint: {}

mmlu_pro_aa_v2#

MMLU-Pro - params aligned with Artificial Analysis Index v2

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_pro_aa_v2

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 10
    task: mmlu_pro
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_pro_aa_v2
target:
  api_endpoint: {}

mmlu_pro_aa_v3#

MMLU-Pro with AA v3 methodology - multi-stage regex extraction with A-J options

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_pro_aa_v3

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 10
    task: mmlu_pro
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config:
        prompt_template: 'Answer the following multiple choice question. The last line of your response should be in the following
          format: ''Answer: A/B/C/D/E/F/G/H/I/J'' (e.g. ''Answer: A'').


          {Question}


          A) {A}

          B) {B}

          C) {C}

          D) {D}

          E) {E}

          F) {F}

          G) {G}

          H) {H}

          I) {I}

          J) {J}

          '
        extraction:
        - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
          match_group: 1
          name: primary_answer_format
        - regex: \\boxed\{[^}]*([A-Z])[^}]*\}
          match_group: 1
          name: latex_boxed
        - regex: answer is ([a-zA-Z])
          match_group: 1
          name: natural_language
        - regex: answer is \(([a-zA-Z])\)
          match_group: 1
          name: with_parenthesis
        - regex: ([A-Z])\)\s*[^A-Z]*
          match_group: 1
          name: choice_format
        - regex: ([A-Z])\s+is\s+the\s+correct\s+answer
          match_group: 1
          name: explicit_statement
        - regex: ([A-Z])\s*$
          match_group: 1
          name: standalone_letter_end
        - regex: ([A-Z])\s*\.
          match_group: 1
          name: letter_with_period
        - regex: ([A-Z])\s*[^\w]
          match_group: 1
          name: letter_nonword
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_pro_aa_v3
target:
  api_endpoint: {}

mmlu_pro_llama_4#

MMLU-Pro questions with custom regex extraction patterns for Llama 4

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_pro_llama_4

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_pro
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config:
        extraction:
        - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
          match_group: 1
          name: answer_colon_llama4
        - regex: (?i)(?:the )?best? answer is\s*[\*\_,{}\.]*([A-D])(?![a-zA-Z0-9])
          match_group: 1
          name: answer_is_llama4
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_pro_llama_4
target:
  api_endpoint: {}

mmlu_pt#

Global-MMLU 0-shot CoT in Portuguese (pt)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_pt

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_pt
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_pt
target:
  api_endpoint: {}

mmlu_pt-lite#

Global-MMLU-Lite 0-shot CoT in Portuguese (pt)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_pt-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_pt-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_pt-lite
target:
  api_endpoint: {}

mmlu_ro#

Global-MMLU 0-shot CoT in Romanian (ro)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ro

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ro
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ro
target:
  api_endpoint: {}

mmlu_ru#

Global-MMLU 0-shot CoT in Russian (ru)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ru

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ru
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ru
target:
  api_endpoint: {}

mmlu_si#

Global-MMLU 0-shot CoT in Sinhala (si)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_si

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_si
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_si
target:
  api_endpoint: {}

mmlu_sn#

Global-MMLU 0-shot CoT in Shona (sn)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_sn

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_sn
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_sn
target:
  api_endpoint: {}

mmlu_so#

Global-MMLU 0-shot CoT in Somali (so)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_so

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_so
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_so
target:
  api_endpoint: {}

mmlu_sr#

Global-MMLU 0-shot CoT in Serbian (sr)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_sr

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_sr
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_sr
target:
  api_endpoint: {}

mmlu_sv#

Global-MMLU 0-shot CoT in Swedish (sv)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_sv

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_sv
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_sv
target:
  api_endpoint: {}

mmlu_sw#

Global-MMLU 0-shot CoT in Swahili (sw)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_sw

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_sw
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_sw
target:
  api_endpoint: {}

mmlu_sw-lite#

Global-MMLU-Lite 0-shot CoT in Swahili (sw)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_sw-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_sw-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_sw-lite
target:
  api_endpoint: {}

mmlu_te#

Global-MMLU 0-shot CoT in Telugu (te)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_te

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_te
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_te
target:
  api_endpoint: {}

mmlu_tr#

Global-MMLU 0-shot CoT in Turkish (tr)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_tr

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_tr
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_tr
target:
  api_endpoint: {}

mmlu_uk#

Global-MMLU 0-shot CoT in Ukrainian (uk)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_uk

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_uk
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_uk
target:
  api_endpoint: {}

mmlu_vi#

Global-MMLU 0-shot CoT in Vietnamese (vi)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_vi

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_vi
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_vi
target:
  api_endpoint: {}

mmlu_yo#

Global-MMLU 0-shot CoT in Yoruba (yo)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_yo

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_yo
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_yo
target:
  api_endpoint: {}

mmlu_yo-lite#

Global-MMLU-Lite 0-shot CoT in Yoruba (yo)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_yo-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_yo-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_yo-lite
target:
  api_endpoint: {}

mmlu_zh-lite#

Global-MMLU-Lite 0-shot CoT in Chinese (Simplified) (zh)

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_zh-lite

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_zh-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_zh-lite
target:
  api_endpoint: {}

simpleqa#

A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: simpleqa

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: simpleqa
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: simpleqa
target:
  api_endpoint: {}