simple_evals#

This page contains all evaluation tasks for the simple_evals harness.

Task	Description
AA_AIME_2024	AIME 2024 questions, math, using Artificial Analysis’s setup.
AA_math_test_500	Open Ai math test 500, using Artificial Analysis’s setup.
AIME_2024	AIME 2024 questions, math
AIME_2025	AIME 2025 questions, math
AIME_2025_aa_v2	AIME 2025 questions, math - params aligned with Artificial Analysis Index v2
aime_2024_nemo	AIME 2024 questions, math, using NeMo’s alignment template
aime_2025_nemo	AIME 2025 questions, math, using NeMo’s alignment template
browsecomp	BrowseComp is a benchmark for measuring the ability for agents to browse the web.
gpqa_diamond	gpqa_diamond 0-shot CoT
gpqa_diamond_aa_v2	gpqa_diamond questions with custom regex extraction patterns for AA v2
gpqa_diamond_aa_v2_llama_4	gpqa_diamond questions with custom regex extraction patterns for Llama 4
gpqa_diamond_aa_v3	GPQA Diamond with AA v3 methodology - multi-stage regex extraction for robust answer parsing
gpqa_diamond_nemo	gpqa_diamond questions, reasoning, using NeMo’s alignment template
gpqa_extended	gpqa_extended 0-shot CoT
gpqa_main	gpqa_main 0-shot CoT
healthbench	HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare.
healthbench_consensus	HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The consensus subset measures 34 particularly important aspects of model behavior and has been validated by the consensus of multiple physicians.
healthbench_hard	HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The hard subset consists of 1000 examples chosen because they are difficult for current frontier models.
humaneval	HumanEval evaluates the performance in Python code generation tasks. It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
humanevalplus	HumanEvalPlus is a dataset of 164 programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
math_test_500	Open AI math test 500
math_test_500_nemo	math_test_500 questions, math, using NeMo’s alignment template
mgsm	MGSM is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages.
mgsm_aa_v2	MGSM is a benchmark of grade-school math problems - params aligned with Artificial Analysis Index v2
mmlu	MMLU 0-shot CoT
mmlu_am	Global-MMLU 0-shot CoT in Amharic (am)
mmlu_ar	Global-MMLU 0-shot CoT in Arabic (ar)
mmlu_ar-lite	Global-MMLU-Lite 0-shot CoT in Arabic (ar)
mmlu_bn	Global-MMLU 0-shot CoT in Bengali (bn)
mmlu_bn-lite	Global-MMLU-Lite 0-shot CoT in Bengali (bn)
mmlu_cs	Global-MMLU 0-shot CoT in Czech (cs)
mmlu_de	Global-MMLU 0-shot CoT in German (de)
mmlu_de-lite	Global-MMLU-Lite 0-shot CoT in German (de)
mmlu_el	Global-MMLU 0-shot CoT in Greek (el)
mmlu_en	Global-MMLU 0-shot CoT in English (en)
mmlu_en-lite	Global-MMLU-Lite 0-shot CoT in English (en)
mmlu_es	Global-MMLU 0-shot CoT in Spanish (es)
mmlu_es-lite	Global-MMLU-Lite 0-shot CoT in Spanish (es)
mmlu_fa	Global-MMLU 0-shot CoT in Persian (fa)
mmlu_fil	Global-MMLU 0-shot CoT in Filipino (fil)
mmlu_fr	Global-MMLU 0-shot CoT in French (fr)
mmlu_fr-lite	Global-MMLU-Lite 0-shot CoT in French (fr)
mmlu_ha	Global-MMLU 0-shot CoT in Hausa (ha)
mmlu_he	Global-MMLU 0-shot CoT in Hebrew (he)
mmlu_hi	Global-MMLU 0-shot CoT in Hindi (hi)
mmlu_hi-lite	Global-MMLU-Lite 0-shot CoT in Hindi (hi)
mmlu_id	Global-MMLU 0-shot CoT in Indonesian (id)
mmlu_id-lite	Global-MMLU-Lite 0-shot CoT in Indonesian (id)
mmlu_ig	Global-MMLU 0-shot CoT in Igbo (ig)
mmlu_it	Global-MMLU 0-shot CoT in Italian (it)
mmlu_it-lite	Global-MMLU-Lite 0-shot CoT in Italian (it)
mmlu_ja	Global-MMLU 0-shot CoT in Japanese (ja)
mmlu_ja-lite	Global-MMLU-Lite 0-shot CoT in Japanese (ja)
mmlu_ko	Global-MMLU 0-shot CoT in Korean (ko)
mmlu_ko-lite	Global-MMLU-Lite 0-shot CoT in Korean (ko)
mmlu_ky	Global-MMLU 0-shot CoT in Kyrgyz (ky)
mmlu_llama_4	MMLU questions with custom regex extraction patterns for Llama 4
mmlu_lt	Global-MMLU 0-shot CoT in Lithuanian (lt)
mmlu_mg	Global-MMLU 0-shot CoT in Malagasy (mg)
mmlu_ms	Global-MMLU 0-shot CoT in Malay (ms)
mmlu_my-lite	Global-MMLU-Lite 0-shot CoT in Malay (my)
mmlu_ne	Global-MMLU 0-shot CoT in Nepali (ne)
mmlu_nl	Global-MMLU 0-shot CoT in Dutch (nl)
mmlu_ny	Global-MMLU 0-shot CoT in Nyanja (ny)
mmlu_pl	Global-MMLU 0-shot CoT in Polish (pl)
mmlu_pro	MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models’ capabilities. This dataset contains 12K complex questions across various disciplines.
mmlu_pro_aa_v2	MMLU-Pro - params aligned with Artificial Analysis Index v2
mmlu_pro_aa_v3	MMLU-Pro with AA v3 methodology - multi-stage regex extraction with A-J options
mmlu_pro_llama_4	MMLU-Pro questions with custom regex extraction patterns for Llama 4
mmlu_pt	Global-MMLU 0-shot CoT in Portuguese (pt)
mmlu_pt-lite	Global-MMLU-Lite 0-shot CoT in Portuguese (pt)
mmlu_ro	Global-MMLU 0-shot CoT in Romanian (ro)
mmlu_ru	Global-MMLU 0-shot CoT in Russian (ru)
mmlu_si	Global-MMLU 0-shot CoT in Sinhala (si)
mmlu_sn	Global-MMLU 0-shot CoT in Shona (sn)
mmlu_so	Global-MMLU 0-shot CoT in Somali (so)
mmlu_sr	Global-MMLU 0-shot CoT in Serbian (sr)
mmlu_sv	Global-MMLU 0-shot CoT in Swedish (sv)
mmlu_sw	Global-MMLU 0-shot CoT in Swahili (sw)
mmlu_sw-lite	Global-MMLU-Lite 0-shot CoT in Swahili (sw)
mmlu_te	Global-MMLU 0-shot CoT in Telugu (te)
mmlu_tr	Global-MMLU 0-shot CoT in Turkish (tr)
mmlu_uk	Global-MMLU 0-shot CoT in Ukrainian (uk)
mmlu_vi	Global-MMLU 0-shot CoT in Vietnamese (vi)
mmlu_yo	Global-MMLU 0-shot CoT in Yoruba (yo)
mmlu_yo-lite	Global-MMLU-Lite 0-shot CoT in Yoruba (yo)
mmlu_zh-lite	Global-MMLU-Lite 0-shot CoT in Chinese (Simplified) (zh)
simpleqa	A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.

AA_AIME_2024#

AIME 2024 questions, math, using Artificial Analysis’s setup.

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: AA_AIME_2024

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: AA_AIME_2024
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: AA_AIME_2024
target:
  api_endpoint: {}

AA_math_test_500#

Open Ai math test 500, using Artificial Analysis’s setup.

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: AA_math_test_500

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: AA_math_test_500
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 3
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: AA_math_test_500
target:
  api_endpoint: {}

AIME_2024#

AIME 2024 questions, math

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: AIME_2024

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: AIME_2024
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: AIME_2024
target:
  api_endpoint: {}

AIME_2025#

AIME 2025 questions, math

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: AIME_2025

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: AIME_2025
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: AIME_2025
target:
  api_endpoint: {}

AIME_2025_aa_v2#

AIME 2025 questions, math - params aligned with Artificial Analysis Index v2

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: AIME_2025_aa_v2

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 10
    task: AIME_2025
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: AIME_2025_aa_v2
target:
  api_endpoint: {}

aime_2024_nemo#

AIME 2024 questions, math, using NeMo’s alignment template

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: aime_2024_nemo

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: aime_2024_nemo
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: aime_2024_nemo
target:
  api_endpoint: {}

aime_2025_nemo#

AIME 2025 questions, math, using NeMo’s alignment template

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: aime_2025_nemo

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: aime_2025_nemo
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: aime_2025_nemo
target:
  api_endpoint: {}

browsecomp#

BrowseComp is a benchmark for measuring the ability for agents to browse the web.

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: browsecomp

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: browsecomp
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: browsecomp
target:
  api_endpoint: {}

gpqa_diamond#

gpqa_diamond 0-shot CoT

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: gpqa_diamond

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: gpqa_diamond
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: gpqa_diamond
target:
  api_endpoint: {}

gpqa_diamond_aa_v2#

gpqa_diamond questions with custom regex extraction patterns for AA v2

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: gpqa_diamond_aa_v2

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 10
    task: gpqa_diamond
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 5
      downsampling_ratio: null
      add_system_prompt: false
      custom_config:
        extraction:
        - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
          match_group: 1
          name: aa_v2_regex
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: gpqa_diamond_aa_v2
target:
  api_endpoint: {}

gpqa_diamond_aa_v2_llama_4#

gpqa_diamond questions with custom regex extraction patterns for Llama 4

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: gpqa_diamond_aa_v2_llama_4

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: gpqa_diamond
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 5
      downsampling_ratio: null
      add_system_prompt: false
      custom_config:
        extraction:
        - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
          match_group: 1
          name: answer_colon_llama4
        - regex: (?i)(?:the )?best? answer is\s*[\*\_,{}\.]*([A-D])(?![a-zA-Z0-9])
          match_group: 1
          name: answer_is_llama4
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: gpqa_diamond_aa_v2_llama_4
target:
  api_endpoint: {}

gpqa_diamond_aa_v3#

GPQA Diamond with AA v3 methodology - multi-stage regex extraction for robust answer parsing

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: gpqa_diamond_aa_v3

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 10
    task: gpqa_diamond
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 5
      downsampling_ratio: null
      add_system_prompt: false
      custom_config:
        prompt_template: 'Answer the following multiple choice question. The last line of your response should be in the following
          format: ''Answer: A/B/C/D'' (e.g. ''Answer: A'').


          {Question}


          A) {A}

          B) {B}

          C) {C}

          D) {D}

          '
        extraction:
        - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
          match_group: 1
          name: primary_answer_format
        - regex: \\boxed\{[^}]*([A-Z])[^}]*\}
          match_group: 1
          name: latex_boxed
        - regex: answer is ([a-zA-Z])
          match_group: 1
          name: natural_language
        - regex: answer is \(([a-zA-Z])\)
          match_group: 1
          name: with_parenthesis
        - regex: ([A-Z])\)\s*[^A-Z]*
          match_group: 1
          name: choice_format
        - regex: ([A-Z])\s+is\s+the\s+correct\s+answer
          match_group: 1
          name: explicit_statement
        - regex: ([A-Z])\s*$
          match_group: 1
          name: standalone_letter_end
        - regex: ([A-Z])\s*\.
          match_group: 1
          name: letter_with_period
        - regex: ([A-Z])\s*[^\w]
          match_group: 1
          name: letter_nonword
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: gpqa_diamond_aa_v3
target:
  api_endpoint: {}

gpqa_diamond_nemo#

gpqa_diamond questions, reasoning, using NeMo’s alignment template

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: gpqa_diamond_nemo

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: gpqa_diamond_nemo
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 5
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: gpqa_diamond_nemo
target:
  api_endpoint: {}

gpqa_extended#

gpqa_extended 0-shot CoT

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: gpqa_extended

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: gpqa_extended
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: gpqa_extended
target:
  api_endpoint: {}

gpqa_main#

gpqa_main 0-shot CoT

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: gpqa_main

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: gpqa_main
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: gpqa_main
target:
  api_endpoint: {}

healthbench#

HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare.

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: healthbench

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: healthbench
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: healthbench
target:
  api_endpoint: {}

healthbench_consensus#

HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The consensus subset measures 34 particularly important aspects of model behavior and has been validated by the consensus of multiple physicians.

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: healthbench_consensus

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: healthbench_consensus
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: healthbench_consensus
target:
  api_endpoint: {}

healthbench_hard#

HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The hard subset consists of 1000 examples chosen because they are difficult for current frontier models.

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: healthbench_hard

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: healthbench_hard
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: JUDGE_API_KEY
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: healthbench_hard
target:
  api_endpoint: {}

humaneval#

HumanEval evaluates the performance in Python code generation tasks. It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: humaneval

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: humaneval
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: humaneval
target:
  api_endpoint: {}

humanevalplus#

HumanEvalPlus is a dataset of 164 programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: humanevalplus

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: humanevalplus
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: humanevalplus
target:
  api_endpoint: {}

math_test_500#

Open AI math test 500

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: math_test_500

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: math_test_500
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: math_test_500
target:
  api_endpoint: {}

math_test_500_nemo#

math_test_500 questions, math, using NeMo’s alignment template

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: math_test_500_nemo

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: math_test_500_nemo
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 3
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: math_test_500_nemo
target:
  api_endpoint: {}

mgsm#

MGSM is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages.

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mgsm

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mgsm
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mgsm
target:
  api_endpoint: {}

mgsm_aa_v2#

MGSM is a benchmark of grade-school math problems - params aligned with Artificial Analysis Index v2

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mgsm_aa_v2

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 10
    task: mgsm
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mgsm_aa_v2
target:
  api_endpoint: {}

mmlu#

MMLU 0-shot CoT

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu
target:
  api_endpoint: {}

mmlu_am#

Global-MMLU 0-shot CoT in Amharic (am)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_am

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_am
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_am
target:
  api_endpoint: {}

mmlu_ar#

Global-MMLU 0-shot CoT in Arabic (ar)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ar

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ar
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ar
target:
  api_endpoint: {}

mmlu_ar-lite#

Global-MMLU-Lite 0-shot CoT in Arabic (ar)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ar-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ar-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ar-lite
target:
  api_endpoint: {}

mmlu_bn#

Global-MMLU 0-shot CoT in Bengali (bn)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_bn

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_bn
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_bn
target:
  api_endpoint: {}

mmlu_bn-lite#

Global-MMLU-Lite 0-shot CoT in Bengali (bn)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_bn-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_bn-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_bn-lite
target:
  api_endpoint: {}

mmlu_cs#

Global-MMLU 0-shot CoT in Czech (cs)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_cs

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_cs
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_cs
target:
  api_endpoint: {}

mmlu_de#

Global-MMLU 0-shot CoT in German (de)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_de

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_de
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_de
target:
  api_endpoint: {}

mmlu_de-lite#

Global-MMLU-Lite 0-shot CoT in German (de)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_de-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_de-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_de-lite
target:
  api_endpoint: {}

mmlu_el#

Global-MMLU 0-shot CoT in Greek (el)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_el

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_el
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_el
target:
  api_endpoint: {}

mmlu_en#

Global-MMLU 0-shot CoT in English (en)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_en

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_en
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_en
target:
  api_endpoint: {}

mmlu_en-lite#

Global-MMLU-Lite 0-shot CoT in English (en)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_en-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_en-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_en-lite
target:
  api_endpoint: {}

mmlu_es#

Global-MMLU 0-shot CoT in Spanish (es)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_es

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_es
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_es
target:
  api_endpoint: {}

mmlu_es-lite#

Global-MMLU-Lite 0-shot CoT in Spanish (es)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_es-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_es-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_es-lite
target:
  api_endpoint: {}

mmlu_fa#

Global-MMLU 0-shot CoT in Persian (fa)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_fa

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_fa
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_fa
target:
  api_endpoint: {}

mmlu_fil#

Global-MMLU 0-shot CoT in Filipino (fil)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_fil

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_fil
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_fil
target:
  api_endpoint: {}

mmlu_fr#

Global-MMLU 0-shot CoT in French (fr)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_fr

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_fr
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_fr
target:
  api_endpoint: {}

mmlu_fr-lite#

Global-MMLU-Lite 0-shot CoT in French (fr)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_fr-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_fr-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_fr-lite
target:
  api_endpoint: {}

mmlu_ha#

Global-MMLU 0-shot CoT in Hausa (ha)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ha

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ha
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ha
target:
  api_endpoint: {}

mmlu_he#

Global-MMLU 0-shot CoT in Hebrew (he)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_he

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_he
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_he
target:
  api_endpoint: {}

mmlu_hi#

Global-MMLU 0-shot CoT in Hindi (hi)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_hi

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_hi
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_hi
target:
  api_endpoint: {}

mmlu_hi-lite#

Global-MMLU-Lite 0-shot CoT in Hindi (hi)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_hi-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_hi-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_hi-lite
target:
  api_endpoint: {}

mmlu_id#

Global-MMLU 0-shot CoT in Indonesian (id)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_id

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_id
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_id
target:
  api_endpoint: {}

mmlu_id-lite#

Global-MMLU-Lite 0-shot CoT in Indonesian (id)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_id-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_id-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_id-lite
target:
  api_endpoint: {}

mmlu_ig#

Global-MMLU 0-shot CoT in Igbo (ig)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ig

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ig
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ig
target:
  api_endpoint: {}

mmlu_it#

Global-MMLU 0-shot CoT in Italian (it)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_it

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_it
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_it
target:
  api_endpoint: {}

mmlu_it-lite#

Global-MMLU-Lite 0-shot CoT in Italian (it)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_it-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_it-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_it-lite
target:
  api_endpoint: {}

mmlu_ja#

Global-MMLU 0-shot CoT in Japanese (ja)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ja

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ja
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ja
target:
  api_endpoint: {}

mmlu_ja-lite#

Global-MMLU-Lite 0-shot CoT in Japanese (ja)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ja-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ja-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ja-lite
target:
  api_endpoint: {}

mmlu_ko#

Global-MMLU 0-shot CoT in Korean (ko)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ko

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ko
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ko
target:
  api_endpoint: {}

mmlu_ko-lite#

Global-MMLU-Lite 0-shot CoT in Korean (ko)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ko-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ko-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ko-lite
target:
  api_endpoint: {}

mmlu_ky#

Global-MMLU 0-shot CoT in Kyrgyz (ky)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ky

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ky
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ky
target:
  api_endpoint: {}

mmlu_llama_4#

MMLU questions with custom regex extraction patterns for Llama 4

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_llama_4

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config:
        extraction:
        - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
          match_group: 1
          name: answer_colon_llama4
        - regex: (?i)(?:the )?best? answer is\s*[\*\_,{}\.]*([A-D])(?![a-zA-Z0-9])
          match_group: 1
          name: answer_is_llama4
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_llama_4
target:
  api_endpoint: {}

mmlu_lt#

Global-MMLU 0-shot CoT in Lithuanian (lt)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_lt

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_lt
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_lt
target:
  api_endpoint: {}

mmlu_mg#

Global-MMLU 0-shot CoT in Malagasy (mg)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_mg

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_mg
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_mg
target:
  api_endpoint: {}

mmlu_ms#

Global-MMLU 0-shot CoT in Malay (ms)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ms

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ms
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ms
target:
  api_endpoint: {}

mmlu_my-lite#

Global-MMLU-Lite 0-shot CoT in Malay (my)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_my-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_my-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_my-lite
target:
  api_endpoint: {}

mmlu_ne#

Global-MMLU 0-shot CoT in Nepali (ne)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ne

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ne
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ne
target:
  api_endpoint: {}

mmlu_nl#

Global-MMLU 0-shot CoT in Dutch (nl)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_nl

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_nl
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_nl
target:
  api_endpoint: {}

mmlu_ny#

Global-MMLU 0-shot CoT in Nyanja (ny)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ny

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ny
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ny
target:
  api_endpoint: {}

mmlu_pl#

Global-MMLU 0-shot CoT in Polish (pl)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_pl

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_pl
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_pl
target:
  api_endpoint: {}

mmlu_pro#

MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models’ capabilities. This dataset contains 12K complex questions across various disciplines.

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_pro

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_pro
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_pro
target:
  api_endpoint: {}

mmlu_pro_aa_v2#

MMLU-Pro - params aligned with Artificial Analysis Index v2

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_pro_aa_v2

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 10
    task: mmlu_pro
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_pro_aa_v2
target:
  api_endpoint: {}

mmlu_pro_aa_v3#

MMLU-Pro with AA v3 methodology - multi-stage regex extraction with A-J options

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_pro_aa_v3

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 10
    task: mmlu_pro
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config:
        prompt_template: 'Answer the following multiple choice question. The last line of your response should be in the following
          format: ''Answer: A/B/C/D/E/F/G/H/I/J'' (e.g. ''Answer: A'').


          {Question}


          A) {A}

          B) {B}

          C) {C}

          D) {D}

          E) {E}

          F) {F}

          G) {G}

          H) {H}

          I) {I}

          J) {J}

          '
        extraction:
        - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
          match_group: 1
          name: primary_answer_format
        - regex: \\boxed\{[^}]*([A-Z])[^}]*\}
          match_group: 1
          name: latex_boxed
        - regex: answer is ([a-zA-Z])
          match_group: 1
          name: natural_language
        - regex: answer is \(([a-zA-Z])\)
          match_group: 1
          name: with_parenthesis
        - regex: ([A-Z])\)\s*[^A-Z]*
          match_group: 1
          name: choice_format
        - regex: ([A-Z])\s+is\s+the\s+correct\s+answer
          match_group: 1
          name: explicit_statement
        - regex: ([A-Z])\s*$
          match_group: 1
          name: standalone_letter_end
        - regex: ([A-Z])\s*\.
          match_group: 1
          name: letter_with_period
        - regex: ([A-Z])\s*[^\w]
          match_group: 1
          name: letter_nonword
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_pro_aa_v3
target:
  api_endpoint: {}

mmlu_pro_llama_4#

MMLU-Pro questions with custom regex extraction patterns for Llama 4

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_pro_llama_4

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_pro
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config:
        extraction:
        - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
          match_group: 1
          name: answer_colon_llama4
        - regex: (?i)(?:the )?best? answer is\s*[\*\_,{}\.]*([A-D])(?![a-zA-Z0-9])
          match_group: 1
          name: answer_is_llama4
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_pro_llama_4
target:
  api_endpoint: {}

mmlu_pt#

Global-MMLU 0-shot CoT in Portuguese (pt)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_pt

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_pt
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_pt
target:
  api_endpoint: {}

mmlu_pt-lite#

Global-MMLU-Lite 0-shot CoT in Portuguese (pt)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_pt-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_pt-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_pt-lite
target:
  api_endpoint: {}

mmlu_ro#

Global-MMLU 0-shot CoT in Romanian (ro)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ro

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ro
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ro
target:
  api_endpoint: {}

mmlu_ru#

Global-MMLU 0-shot CoT in Russian (ru)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_ru

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_ru
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_ru
target:
  api_endpoint: {}

mmlu_si#

Global-MMLU 0-shot CoT in Sinhala (si)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_si

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_si
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_si
target:
  api_endpoint: {}

mmlu_sn#

Global-MMLU 0-shot CoT in Shona (sn)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_sn

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_sn
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_sn
target:
  api_endpoint: {}

mmlu_so#

Global-MMLU 0-shot CoT in Somali (so)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_so

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_so
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_so
target:
  api_endpoint: {}

mmlu_sr#

Global-MMLU 0-shot CoT in Serbian (sr)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_sr

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_sr
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_sr
target:
  api_endpoint: {}

mmlu_sv#

Global-MMLU 0-shot CoT in Swedish (sv)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_sv

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_sv
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_sv
target:
  api_endpoint: {}

mmlu_sw#

Global-MMLU 0-shot CoT in Swahili (sw)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_sw

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_sw
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_sw
target:
  api_endpoint: {}

mmlu_sw-lite#

Global-MMLU-Lite 0-shot CoT in Swahili (sw)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_sw-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_sw-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_sw-lite
target:
  api_endpoint: {}

mmlu_te#

Global-MMLU 0-shot CoT in Telugu (te)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_te

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_te
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_te
target:
  api_endpoint: {}

mmlu_tr#

Global-MMLU 0-shot CoT in Turkish (tr)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_tr

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_tr
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_tr
target:
  api_endpoint: {}

mmlu_uk#

Global-MMLU 0-shot CoT in Ukrainian (uk)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_uk

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_uk
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_uk
target:
  api_endpoint: {}

mmlu_vi#

Global-MMLU 0-shot CoT in Vietnamese (vi)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_vi

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_vi
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_vi
target:
  api_endpoint: {}

mmlu_yo#

Global-MMLU 0-shot CoT in Yoruba (yo)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_yo

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_yo
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_yo
target:
  api_endpoint: {}

mmlu_yo-lite#

Global-MMLU-Lite 0-shot CoT in Yoruba (yo)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_yo-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_yo-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_yo-lite
target:
  api_endpoint: {}

mmlu_zh-lite#

Global-MMLU-Lite 0-shot CoT in Chinese (Simplified) (zh)

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: mmlu_zh-lite

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: mmlu_zh-lite
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: mmlu_zh-lite
target:
  api_endpoint: {}

simpleqa#

A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.

Container

Harness: simple_evals

Container:

nvcr.io/nvidia/eval-factory/simple-evals:26.01

Container Digest:

sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158

Container Arch: multiarch

Task Type: simpleqa

Command

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt  %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}

Defaults

framework_name: simple_evals
pkg_name: simple_evals
config:
  params:
    max_new_tokens: 16384
    max_retries: 5
    parallelism: 10
    task: simpleqa
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 1
      downsampling_ratio: null
      add_system_prompt: false
      custom_config: null
      judge:
        url: null
        model_id: null
        api_key: null
        backend: openai
        request_timeout: 600
        max_retries: 16
        temperature: 0.0
        top_p: 0.0001
        max_tokens: 1024
        max_concurrent_requests: null
  supported_endpoint_types:
  - chat
  type: simpleqa
target:
  api_endpoint: {}