lm-evaluation-harness#
This page contains all evaluation tasks for the lm-evaluation-harness harness.
Task |
Description |
|---|---|
Version of the AGIEval-EN-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
ARC-Challenge-Llama version used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
CommonsenseQA version used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
Global-MMLU subset (8 languages - es, de, fr, zh, it, ja, pt, ko) used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
Version of the GPQA-Diamond-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
GSM8K-CoT version used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
HumanEval Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
HumanEval Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
MATH-500 Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
MBPP Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
MBPP Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
MGSM native CoT subset (6 languages - es, de, fr, zh, ja, ru) used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
Minerva-Math version used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
MMLU version used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
MMLU-Pro 5-shot base version used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
RACE version used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
TruthfulQA-MC2 version used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
Winogrande version used by NVIDIA Applied Deep Learning Research team (ADLR). |
|
AGIEval - A Human-Centric Benchmark for Evaluating Foundation Models |
|
The ARC challenge dataset consists of 2,590 multiple-choice science exam questions. |
|
|
|
The multilingual versions of the ARC challenge dataset. |
|
The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with. |
|
|
|
The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (chat endpoint). |
|
The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (completions endpoint). |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry. |
|
|
|
The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. |
|
|
|
|
|
|
|
|
|
The HellaSwag benchmark tests a language model’s commonsense reasoning by having it choose the most logical ending for a given story. |
|
The multilingual versions of the HellaSwag benchmark. |
|
|
|
IFEval is a dataset designed to test a model’s ability to follow explicit instructions, such as “include keyword x” or “use format y.” The focus is on the model’s adherence to formatting instructions rather than the content generated, allowing for the use of strict and rigorous metrics. |
|
|
|
|
|
MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (chat endpoint). |
|
MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (completions endpoint). |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4 (completions endpoint). |
|
|
|
A Multilingual Benchmark for Advanced Large Language Model Evaluation (chat endpoint) |
|
A Multilingual Benchmark for Advanced Large Language Model Evaluation (completions endpoint) |
|
A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (chat endpoint) |
|
A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (completions endpoint) |
|
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (chat endpoint) |
|
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (completions endpoint) |
|
A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (chat endpoint) |
|
A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (completions endpoint) |
|
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (chat endpoint) |
|
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (completions endpoint) |
|
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (chat endpoint) |
|
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (completions endpoint) |
|
MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. |
|
|
|
The MuSR (Multistep Soft Reasoning) benchmark evaluates the reasoning capabilities of large language models through complex, multistep tasks specified in natural language narratives. |
|
|
|
|
|
|
|
|
|
|
|
|
|
WinoGrande is a collection of 44k problems formulated as a fill-in-a-blank task with binary options testing commonsense reasoning. |
adlr_agieval_en_cot#
Version of the AGIEval-EN-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_agieval_en_cot
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_agieval_en_cot
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: adlr_agieval_en_cot
target:
api_endpoint:
stream: false
adlr_arc_challenge_llama_25_shot#
ARC-Challenge-Llama version used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_arc_challenge_llama_25_shot
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_arc_challenge_llama
temperature: 1.0
request_timeout: 30
top_p: 1.0
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 25
supported_endpoint_types:
- completions
type: adlr_arc_challenge_llama_25_shot
target:
api_endpoint:
stream: false
adlr_commonsense_qa_7_shot#
CommonsenseQA version used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_commonsense_qa_7_shot
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: commonsense_qa
temperature: 1.0
request_timeout: 30
top_p: 1.0
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 7
supported_endpoint_types:
- completions
type: adlr_commonsense_qa_7_shot
target:
api_endpoint:
stream: false
adlr_global_mmlu_lite_5_shot#
Global-MMLU subset (8 languages - es, de, fr, zh, it, ja, pt, ko) used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_global_mmlu_lite_5_shot
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_global_mmlu
temperature: 1.0
request_timeout: 30
top_p: 1.0
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
supported_endpoint_types:
- completions
type: adlr_global_mmlu_lite_5_shot
target:
api_endpoint:
stream: false
adlr_gpqa_diamond_cot_5_shot#
Version of the GPQA-Diamond-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_gpqa_diamond_cot_5_shot
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_gpqa_diamond_cot_5_shot
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
supported_endpoint_types:
- completions
type: adlr_gpqa_diamond_cot_5_shot
target:
api_endpoint:
stream: false
adlr_gsm8k_cot_8_shot#
GSM8K-CoT version used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_gsm8k_cot_8_shot
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_gsm8k_fewshot_cot
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 8
supported_endpoint_types:
- completions
type: adlr_gsm8k_cot_8_shot
target:
api_endpoint:
stream: false
adlr_humaneval_greedy#
HumanEval Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_humaneval_greedy
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_humaneval_greedy
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: adlr_humaneval_greedy
target:
api_endpoint:
stream: false
adlr_humaneval_sampled#
HumanEval Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_humaneval_sampled
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_humaneval_sampled
temperature: 0.6
request_timeout: 30
top_p: 0.95
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: adlr_humaneval_sampled
target:
api_endpoint:
stream: false
adlr_math_500_4_shot_sampled#
MATH-500 Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_math_500_4_shot_sampled
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_math_500_4_shot_sampled
temperature: 0.7
request_timeout: 30
top_p: 1.0
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 4
supported_endpoint_types:
- completions
type: adlr_math_500_4_shot_sampled
target:
api_endpoint:
stream: false
adlr_mbpp_sanitized_3_shot_greedy#
MBPP Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_mbpp_sanitized_3_shot_greedy
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_mbpp_sanitized_3_shot_greedy
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 3
supported_endpoint_types:
- completions
type: adlr_mbpp_sanitized_3_shot_greedy
target:
api_endpoint:
stream: false
adlr_mbpp_sanitized_3_shot_sampled#
MBPP Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_mbpp_sanitized_3_shot_sampled
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_mbpp_sanitized_3shot_sampled
temperature: 0.6
request_timeout: 30
top_p: 0.95
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 3
supported_endpoint_types:
- completions
type: adlr_mbpp_sanitized_3_shot_sampled
target:
api_endpoint:
stream: false
adlr_mgsm_native_cot_8_shot#
MGSM native CoT subset (6 languages - es, de, fr, zh, ja, ru) used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_mgsm_native_cot_8_shot
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_mgsm_native_cot_8_shot
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 8
supported_endpoint_types:
- completions
type: adlr_mgsm_native_cot_8_shot
target:
api_endpoint:
stream: false
adlr_minerva_math_nemo_4_shot#
Minerva-Math version used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_minerva_math_nemo_4_shot
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_minerva_math_nemo
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 4
supported_endpoint_types:
- completions
type: adlr_minerva_math_nemo_4_shot
target:
api_endpoint:
stream: false
adlr_mmlu#
MMLU version used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_mmlu
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_str
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
args: --trust_remote_code
supported_endpoint_types:
- completions
type: adlr_mmlu
target:
api_endpoint:
stream: false
adlr_mmlu_pro_5_shot_base#
MMLU-Pro 5-shot base version used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_mmlu_pro_5_shot_base
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_mmlu_pro_5_shot_base
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
supported_endpoint_types:
- completions
type: adlr_mmlu_pro_5_shot_base
target:
api_endpoint:
stream: false
adlr_race#
RACE version used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_race
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_race
temperature: 1.0
request_timeout: 30
top_p: 1.0
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: adlr_race
target:
api_endpoint:
stream: false
adlr_truthfulqa_mc2#
TruthfulQA-MC2 version used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_truthfulqa_mc2
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_truthfulqa_mc2
temperature: 1.0
request_timeout: 30
top_p: 1.0
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: adlr_truthfulqa_mc2
target:
api_endpoint:
stream: false
adlr_winogrande_5_shot#
Winogrande version used by NVIDIA Applied Deep Learning Research team (ADLR).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: adlr_winogrande_5_shot
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: winogrande
temperature: 1.0
request_timeout: 30
top_p: 1.0
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
supported_endpoint_types:
- completions
type: adlr_winogrande_5_shot
target:
api_endpoint:
stream: false
agieval#
AGIEval - A Human-Centric Benchmark for Evaluating Foundation Models
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: agieval
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: agieval
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: agieval
target:
api_endpoint:
stream: false
arc_challenge#
The ARC challenge dataset consists of 2,590 multiple-choice science exam questions.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: arc_challenge
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: arc_challenge
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: arc_challenge
target:
api_endpoint:
stream: false
arc_challenge_chat#
The ARC challenge dataset consists of 2,590 multiple-choice science exam questions. - This variant applies a chat template and defaults to zero-shot evaluation.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: arc_challenge_chat
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: arc_challenge_chat
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
supported_endpoint_types:
- chat
type: arc_challenge_chat
target:
api_endpoint:
stream: false
arc_multilingual#
The multilingual versions of the ARC challenge dataset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: arc_multilingual
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: arc_multilingual
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: arc_multilingual
target:
api_endpoint:
stream: false
bbh#
The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: bbh
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: leaderboard_bbh
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: bbh
target:
api_endpoint:
stream: false
bbh_instruct#
The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with. - This variant aaplies chat template and defaults to zero-shot evaluation.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: bbh_instruct
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: bbh_zeroshot
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: bbh_instruct
target:
api_endpoint:
stream: false
bbq_chat#
The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (chat endpoint).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: bbq_chat
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: bbq_generate
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: bbq_chat
target:
api_endpoint:
stream: false
bbq_completions#
The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (completions endpoint).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: bbq_completions
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: bbq_generate
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: bbq_completions
target:
api_endpoint:
stream: false
commonsense_qa#
CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers. - It contains 12,102 questions with one correct answer and four distractor answers.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: commonsense_qa
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: commonsense_qa
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 7
supported_endpoint_types:
- completions
type: commonsense_qa
target:
api_endpoint:
stream: false
global_mmlu#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - It is designed for efficient evaluation of multilingual models in 15 languages (including English).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu
target:
api_endpoint:
stream: false
global_mmlu_ar#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the AR subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_ar
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_ar
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_ar
target:
api_endpoint:
stream: false
global_mmlu_bn#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the BN subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_bn
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_bn
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_bn
target:
api_endpoint:
stream: false
global_mmlu_de#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the DE subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_de
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_de
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_de
target:
api_endpoint:
stream: false
global_mmlu_en#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the EN subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_en
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_en
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_en
target:
api_endpoint:
stream: false
global_mmlu_es#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ES subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_es
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_es
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_es
target:
api_endpoint:
stream: false
global_mmlu_fr#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the FR subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_fr
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_fr
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_fr
target:
api_endpoint:
stream: false
global_mmlu_full#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full
target:
api_endpoint:
stream: false
global_mmlu_full_am#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AM subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_am
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_am
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_am
target:
api_endpoint:
stream: false
global_mmlu_full_ar#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AR subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_ar
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ar
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ar
target:
api_endpoint:
stream: false
global_mmlu_full_bn#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the BN subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_bn
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_bn
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_bn
target:
api_endpoint:
stream: false
global_mmlu_full_cs#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the CS subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_cs
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_cs
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_cs
target:
api_endpoint:
stream: false
global_mmlu_full_de#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the DE subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_de
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_de
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_de
target:
api_endpoint:
stream: false
global_mmlu_full_el#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EL subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_el
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_el
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_el
target:
api_endpoint:
stream: false
global_mmlu_full_en#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EN subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_en
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_en
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_en
target:
api_endpoint:
stream: false
global_mmlu_full_es#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ES subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_es
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_es
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_es
target:
api_endpoint:
stream: false
global_mmlu_full_fa#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FA subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_fa
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_fa
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_fa
target:
api_endpoint:
stream: false
global_mmlu_full_fil#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FIL subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_fil
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_fil
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_fil
target:
api_endpoint:
stream: false
global_mmlu_full_fr#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FR subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_fr
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_fr
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_fr
target:
api_endpoint:
stream: false
global_mmlu_full_ha#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HA subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_ha
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ha
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ha
target:
api_endpoint:
stream: false
global_mmlu_full_he#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HE subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_he
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_he
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_he
target:
api_endpoint:
stream: false
global_mmlu_full_hi#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HI subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_hi
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_hi
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_hi
target:
api_endpoint:
stream: false
global_mmlu_full_id#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ID subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_id
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_id
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_id
target:
api_endpoint:
stream: false
global_mmlu_full_ig#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IG subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_ig
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ig
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ig
target:
api_endpoint:
stream: false
global_mmlu_full_it#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IT subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_it
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_it
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_it
target:
api_endpoint:
stream: false
global_mmlu_full_ja#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the JA subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_ja
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ja
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ja
target:
api_endpoint:
stream: false
global_mmlu_full_ko#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KO subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_ko
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ko
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ko
target:
api_endpoint:
stream: false
global_mmlu_full_ky#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KY subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_ky
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ky
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ky
target:
api_endpoint:
stream: false
global_mmlu_full_lt#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the LT subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_lt
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_lt
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_lt
target:
api_endpoint:
stream: false
global_mmlu_full_mg#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MG subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_mg
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_mg
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_mg
target:
api_endpoint:
stream: false
global_mmlu_full_ms#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MS subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_ms
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ms
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ms
target:
api_endpoint:
stream: false
global_mmlu_full_ne#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NE subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_ne
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ne
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ne
target:
api_endpoint:
stream: false
global_mmlu_full_nl#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NL subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_nl
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_nl
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_nl
target:
api_endpoint:
stream: false
global_mmlu_full_ny#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NY subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_ny
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ny
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ny
target:
api_endpoint:
stream: false
global_mmlu_full_pl#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PL subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_pl
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_pl
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_pl
target:
api_endpoint:
stream: false
global_mmlu_full_pt#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PT subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_pt
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_pt
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_pt
target:
api_endpoint:
stream: false
global_mmlu_full_ro#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RO subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_ro
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ro
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ro
target:
api_endpoint:
stream: false
global_mmlu_full_ru#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RU subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_ru
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ru
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ru
target:
api_endpoint:
stream: false
global_mmlu_full_si#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SI subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_si
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_si
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_si
target:
api_endpoint:
stream: false
global_mmlu_full_sn#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SN subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_sn
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_sn
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_sn
target:
api_endpoint:
stream: false
global_mmlu_full_so#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SO subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_so
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_so
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_so
target:
api_endpoint:
stream: false
global_mmlu_full_sr#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SR subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_sr
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_sr
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_sr
target:
api_endpoint:
stream: false
global_mmlu_full_sv#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SV subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_sv
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_sv
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_sv
target:
api_endpoint:
stream: false
global_mmlu_full_sw#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SW subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_sw
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_sw
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_sw
target:
api_endpoint:
stream: false
global_mmlu_full_te#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TE subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_te
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_te
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_te
target:
api_endpoint:
stream: false
global_mmlu_full_tr#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TR subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_tr
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_tr
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_tr
target:
api_endpoint:
stream: false
global_mmlu_full_uk#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the UK subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_uk
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_uk
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_uk
target:
api_endpoint:
stream: false
global_mmlu_full_vi#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the VI subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_vi
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_vi
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_vi
target:
api_endpoint:
stream: false
global_mmlu_full_yo#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the YO subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_yo
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_yo
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_yo
target:
api_endpoint:
stream: false
global_mmlu_full_zh#
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ZH subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_full_zh
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_zh
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_zh
target:
api_endpoint:
stream: false
global_mmlu_hi#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the HI subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_hi
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_hi
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_hi
target:
api_endpoint:
stream: false
global_mmlu_id#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ID subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_id
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_id
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_id
target:
api_endpoint:
stream: false
global_mmlu_it#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the IT subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_it
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_it
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_it
target:
api_endpoint:
stream: false
global_mmlu_ja#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the JA subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_ja
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_ja
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_ja
target:
api_endpoint:
stream: false
global_mmlu_ko#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the KO subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_ko
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_ko
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_ko
target:
api_endpoint:
stream: false
global_mmlu_pt#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the PT subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_pt
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_pt
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_pt
target:
api_endpoint:
stream: false
global_mmlu_sw#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the SW subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_sw
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_sw
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_sw
target:
api_endpoint:
stream: false
global_mmlu_yo#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the YO subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_yo
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_yo
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_yo
target:
api_endpoint:
stream: false
global_mmlu_zh#
Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ZH subset.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: global_mmlu_zh
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_zh
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_zh
target:
api_endpoint:
stream: false
gpqa#
The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: gpqa
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: leaderboard_gpqa
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: gpqa
target:
api_endpoint:
stream: false
gpqa_diamond_cot#
The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry. - This variant uses the Diamond subset and defaults to zero-shot chain-of-thought evaluation.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: gpqa_diamond_cot
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: gpqa_diamond_cot_zeroshot
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: gpqa_diamond_cot
target:
api_endpoint:
stream: false
gsm8k#
The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: gsm8k
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: gsm8k
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: gsm8k
target:
api_endpoint:
stream: false
gsm8k_cot_instruct#
The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation with custom instructions.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: gsm8k_cot_instruct
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: gsm8k_zeroshot_cot
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
args: --add_instruction
supported_endpoint_types:
- chat
type: gsm8k_cot_instruct
target:
api_endpoint:
stream: false
gsm8k_cot_llama#
The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought evaluation - implementation taken from llama.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: gsm8k_cot_llama
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: gsm8k_cot_llama
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: gsm8k_cot_llama
target:
api_endpoint:
stream: false
gsm8k_cot_zeroshot#
The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: gsm8k_cot_zeroshot
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: gsm8k_cot_zeroshot
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: gsm8k_cot_zeroshot
target:
api_endpoint:
stream: false
gsm8k_cot_zeroshot_llama#
The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation - implementation taken from llama.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: gsm8k_cot_zeroshot_llama
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: gsm8k_cot_llama
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
supported_endpoint_types:
- chat
type: gsm8k_cot_zeroshot_llama
target:
api_endpoint:
stream: false
hellaswag#
The HellaSwag benchmark tests a language model’s commonsense reasoning by having it choose the most logical ending for a given story.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: hellaswag
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: hellaswag
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 10
supported_endpoint_types:
- completions
type: hellaswag
target:
api_endpoint:
stream: false
hellaswag_multilingual#
The multilingual versions of the HellaSwag benchmark.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: hellaswag_multilingual
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: hellaswag_multilingual
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 10
supported_endpoint_types:
- completions
type: hellaswag_multilingual
target:
api_endpoint:
stream: false
humaneval_instruct#
The HumanEval benchmark measures functional correctness for synthesizing programs from docstrings. - Implementation taken from llama.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: humaneval_instruct
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: humaneval_instruct
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: humaneval_instruct
target:
api_endpoint:
stream: false
ifeval#
IFEval is a dataset designed to test a model’s ability to follow explicit instructions, such as “include keyword x” or “use format y.” The focus is on the model’s adherence to formatting instructions rather than the content generated, allowing for the use of strict and rigorous metrics.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: ifeval
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: ifeval
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: ifeval
target:
api_endpoint:
stream: false
m_mmlu_id_str_chat#
The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (chat endpoint).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: m_mmlu_id_str_chat
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: m_mmlu_id_str
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
args: --trust_remote_code
supported_endpoint_types:
- chat
type: m_mmlu_id_str_chat
target:
api_endpoint:
stream: false
m_mmlu_id_str_completions#
The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (completions endpoint).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: m_mmlu_id_str_completions
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: m_mmlu_id_str
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
args: --trust_remote_code
supported_endpoint_types:
- completions
type: m_mmlu_id_str_completions
target:
api_endpoint:
stream: false
mbpp_plus_chat#
MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (chat endpoint).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mbpp_plus_chat
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mbpp_plus
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
args: --confirm_run_unsafe_code
supported_endpoint_types:
- chat
type: mbpp_plus_chat
target:
api_endpoint:
stream: false
mbpp_plus_completions#
MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (completions endpoint).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mbpp_plus_completions
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mbpp_plus
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
args: --confirm_run_unsafe_code
supported_endpoint_types:
- completions
type: mbpp_plus_completions
target:
api_endpoint:
stream: false
mgsm#
The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mgsm
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mgsm_direct
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mgsm
target:
api_endpoint:
stream: false
mgsm_cot_chat#
The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the chat endpoint and defaults to chain-of-thought evaluation.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mgsm_cot_chat
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: mgsm_cot_native
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
supported_endpoint_types:
- chat
type: mgsm_cot_chat
target:
api_endpoint:
stream: false
mgsm_cot_completions#
The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the completions endpoint and defaults to chain-of-thought evaluation.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mgsm_cot_completions
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: mgsm_cot_native
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
supported_endpoint_types:
- completions
type: mgsm_cot_completions
target:
api_endpoint:
stream: false
mmlu#
The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses text generation.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_str
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
args: --trust_remote_code
supported_endpoint_types:
- completions
type: mmlu
target:
api_endpoint:
stream: false
mmlu_cot_0_shot_chat#
The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant defaults to chain-of-thought zero-shot evaluation.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_cot_0_shot_chat
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_cot_0_shot_chat
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
args: --trust_remote_code
supported_endpoint_types:
- chat
type: mmlu_cot_0_shot_chat
target:
api_endpoint:
stream: false
mmlu_instruct#
The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the chat endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_instruct
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_str
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
args: --trust_remote_code --add_instruction
supported_endpoint_types:
- chat
type: mmlu_instruct
target:
api_endpoint:
stream: false
mmlu_instruct_completions#
The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the completions endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_instruct_completions
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_str
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
args: --trust_remote_code --add_instruction
supported_endpoint_types:
- completions
type: mmlu_instruct_completions
target:
api_endpoint:
stream: false
mmlu_logits#
The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the logits of the model to evaluate the accuracy.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_logits
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
supported_endpoint_types:
- completions
type: mmlu_logits
target:
api_endpoint:
stream: false
mmlu_pro#
MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4 (completions endpoint).
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_pro
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_pro
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
supported_endpoint_types:
- completions
type: mmlu_pro
target:
api_endpoint:
stream: false
mmlu_pro_instruct#
MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4. - This variant applies a chat template and defaults to zero-shot evaluation.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_pro_instruct
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: mmlu_pro
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
supported_endpoint_types:
- chat
type: mmlu_pro_instruct
target:
api_endpoint:
stream: false
mmlu_prox_chat#
A Multilingual Benchmark for Advanced Large Language Model Evaluation (chat endpoint)
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_prox_chat
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: mmlu_prox_chat
target:
api_endpoint:
stream: false
mmlu_prox_completions#
A Multilingual Benchmark for Advanced Large Language Model Evaluation (completions endpoint)
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_prox_completions
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mmlu_prox_completions
target:
api_endpoint:
stream: false
mmlu_prox_de_chat#
A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (chat endpoint)
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_prox_de_chat
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_de
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: mmlu_prox_de_chat
target:
api_endpoint:
stream: false
mmlu_prox_de_completions#
A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (completions endpoint)
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_prox_de_completions
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_de
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mmlu_prox_de_completions
target:
api_endpoint:
stream: false
mmlu_prox_es_chat#
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (chat endpoint)
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_prox_es_chat
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_es
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: mmlu_prox_es_chat
target:
api_endpoint:
stream: false
mmlu_prox_es_completions#
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (completions endpoint)
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_prox_es_completions
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_es
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mmlu_prox_es_completions
target:
api_endpoint:
stream: false
mmlu_prox_fr_chat#
A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (chat endpoint)
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_prox_fr_chat
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_fr
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: mmlu_prox_fr_chat
target:
api_endpoint:
stream: false
mmlu_prox_fr_completions#
A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (completions endpoint)
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_prox_fr_completions
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_fr
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mmlu_prox_fr_completions
target:
api_endpoint:
stream: false
mmlu_prox_it_chat#
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (chat endpoint)
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_prox_it_chat
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_it
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: mmlu_prox_it_chat
target:
api_endpoint:
stream: false
mmlu_prox_it_completions#
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (completions endpoint)
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_prox_it_completions
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_it
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mmlu_prox_it_completions
target:
api_endpoint:
stream: false
mmlu_prox_ja_chat#
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (chat endpoint)
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_prox_ja_chat
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_ja
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: mmlu_prox_ja_chat
target:
api_endpoint:
stream: false
mmlu_prox_ja_completions#
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (completions endpoint)
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_prox_ja_completions
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_ja
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mmlu_prox_ja_completions
target:
api_endpoint:
stream: false
mmlu_redux#
MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_redux
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_redux
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mmlu_redux
target:
api_endpoint:
stream: false
mmlu_redux_instruct#
MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. - This variant applies a chat template and defaults to zero-shot evaluation.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: mmlu_redux_instruct
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 8192
max_retries: 5
parallelism: 10
task: mmlu_redux
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
args: --add_instruction
supported_endpoint_types:
- chat
type: mmlu_redux_instruct
target:
api_endpoint:
stream: false
musr#
The MuSR (Multistep Soft Reasoning) benchmark evaluates the reasoning capabilities of large language models through complex, multistep tasks specified in natural language narratives.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: musr
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: leaderboard_musr
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: musr
target:
api_endpoint:
stream: false
openbookqa#
OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding of a subject. - Answering OpenBookQA questions requires additional broad common knowledge, not contained in the book. - The questions, by design, are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: openbookqa
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: openbookqa
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: openbookqa
target:
api_endpoint:
stream: false
piqa#
Physical Interaction: Question Answering (PIQA) is a physical commonsense reasoning benchmark designed to investigate the physical knowledge of large language models.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: piqa
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: piqa
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: piqa
target:
api_endpoint:
stream: false
truthfulqa#
The TruthfulQA benchmark measures the truthfulness of language models in generating answers to questions. - It consists of 817 questions across 38 categories, such as health, law, finance, and politics, designed to test whether models can avoid generating false answers that mimic common human misconceptions.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: truthfulqa
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: truthfulqa
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: truthfulqa
target:
api_endpoint:
stream: false
wikilingua#
The WikiLingua benchmark is a large-scale, multilingual dataset designed for evaluating cross-lingual abstractive summarization systems.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: wikilingua
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: wikilingua
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
args: --trust_remote_code
supported_endpoint_types:
- chat
type: wikilingua
target:
api_endpoint:
stream: false
wikitext#
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. - This task measures perplexity on the WikiText-2 dataset via rolling loglikelihoods.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: wikitext
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: wikitext
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
args: --trust_remote_code
supported_endpoint_types:
- completions
type: wikitext
target:
api_endpoint:
stream: false
winogrande#
WinoGrande is a collection of 44k problems formulated as a fill-in-a-blank task with binary options testing commonsense reasoning.
Harness: lm-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
Container Digest:
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
Container Arch: multiarch
Task Type: winogrande
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: winogrande
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
supported_endpoint_types:
- completions
type: winogrande
target:
api_endpoint:
stream: false
social_iqa#
Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations.
Harness:
lm-evaluation-harnessContainer:
Container Digest:
Container Arch:
multiarchTask Type:
social_iqa