lm-evaluation-harness#

This page contains all evaluation tasks for the lm-evaluation-harness harness.

Task	Description
adlr_agieval_en_cot	Version of the AGIEval-EN-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_arc_challenge_llama_25_shot	ARC-Challenge-Llama version used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_commonsense_qa_7_shot	CommonsenseQA version used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_global_mmlu_lite_5_shot	Global-MMLU subset (8 languages - es, de, fr, zh, it, ja, pt, ko) used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_gpqa_diamond_cot_5_shot	Version of the GPQA-Diamond-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_gsm8k_cot_8_shot	GSM8K-CoT version used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_humaneval_greedy	HumanEval Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_humaneval_sampled	HumanEval Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_math_500_4_shot_sampled	MATH-500 Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_mbpp_sanitized_3_shot_greedy	MBPP Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_mbpp_sanitized_3_shot_sampled	MBPP Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_mgsm_native_cot_8_shot	MGSM native CoT subset (6 languages - es, de, fr, zh, ja, ru) used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_minerva_math_nemo_4_shot	Minerva-Math version used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_mmlu	MMLU version used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_mmlu_pro_5_shot_base	MMLU-Pro 5-shot base version used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_race	RACE version used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_truthfulqa_mc2	TruthfulQA-MC2 version used by NVIDIA Applied Deep Learning Research team (ADLR).
adlr_winogrande_5_shot	Winogrande version used by NVIDIA Applied Deep Learning Research team (ADLR).
agieval	AGIEval - A Human-Centric Benchmark for Evaluating Foundation Models
arc_challenge	The ARC challenge dataset consists of 2,590 multiple-choice science exam questions.
arc_challenge_chat	The ARC challenge dataset consists of 2,590 multiple-choice science exam questions. - This variant applies a chat template and defaults to zero-shot evaluation.
arc_multilingual	The multilingual versions of the ARC challenge dataset.
bbh	The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with.
bbh_instruct	The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with. - This variant aaplies chat template and defaults to zero-shot evaluation.
bbq_chat	The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (chat endpoint).
bbq_completions	The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (completions endpoint).
commonsense_qa	CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers. - It contains 12,102 questions with one correct answer and four distractor answers.
global_mmlu	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - It is designed for efficient evaluation of multilingual models in 15 languages (including English).
global_mmlu_ar	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the AR subset.
global_mmlu_bn	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the BN subset.
global_mmlu_de	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the DE subset.
global_mmlu_en	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the EN subset.
global_mmlu_es	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ES subset.
global_mmlu_fr	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the FR subset.
global_mmlu_full	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English.
global_mmlu_full_am	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AM subset.
global_mmlu_full_ar	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AR subset.
global_mmlu_full_bn	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the BN subset.
global_mmlu_full_cs	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the CS subset.
global_mmlu_full_de	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the DE subset.
global_mmlu_full_el	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EL subset.
global_mmlu_full_en	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EN subset.
global_mmlu_full_es	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ES subset.
global_mmlu_full_fa	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FA subset.
global_mmlu_full_fil	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FIL subset.
global_mmlu_full_fr	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FR subset.
global_mmlu_full_ha	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HA subset.
global_mmlu_full_he	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HE subset.
global_mmlu_full_hi	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HI subset.
global_mmlu_full_id	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ID subset.
global_mmlu_full_ig	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IG subset.
global_mmlu_full_it	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IT subset.
global_mmlu_full_ja	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the JA subset.
global_mmlu_full_ko	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KO subset.
global_mmlu_full_ky	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KY subset.
global_mmlu_full_lt	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the LT subset.
global_mmlu_full_mg	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MG subset.
global_mmlu_full_ms	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MS subset.
global_mmlu_full_ne	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NE subset.
global_mmlu_full_nl	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NL subset.
global_mmlu_full_ny	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NY subset.
global_mmlu_full_pl	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PL subset.
global_mmlu_full_pt	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PT subset.
global_mmlu_full_ro	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RO subset.
global_mmlu_full_ru	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RU subset.
global_mmlu_full_si	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SI subset.
global_mmlu_full_sn	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SN subset.
global_mmlu_full_so	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SO subset.
global_mmlu_full_sr	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SR subset.
global_mmlu_full_sv	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SV subset.
global_mmlu_full_sw	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SW subset.
global_mmlu_full_te	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TE subset.
global_mmlu_full_tr	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TR subset.
global_mmlu_full_uk	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the UK subset.
global_mmlu_full_vi	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the VI subset.
global_mmlu_full_yo	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the YO subset.
global_mmlu_full_zh	Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ZH subset.
global_mmlu_hi	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the HI subset.
global_mmlu_id	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ID subset.
global_mmlu_it	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the IT subset.
global_mmlu_ja	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the JA subset.
global_mmlu_ko	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the KO subset.
global_mmlu_pt	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the PT subset.
global_mmlu_sw	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the SW subset.
global_mmlu_yo	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the YO subset.
global_mmlu_zh	Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ZH subset.
gpqa	The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry.
gpqa_diamond_cot	The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry. - This variant uses the Diamond subset and defaults to zero-shot chain-of-thought evaluation.
gsm8k	The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems.
gsm8k_cot_instruct	The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation with custom instructions.
gsm8k_cot_llama	The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought evaluation - implementation taken from llama.
gsm8k_cot_zeroshot	The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation.
gsm8k_cot_zeroshot_llama	The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation - implementation taken from llama.
hellaswag	The HellaSwag benchmark tests a language model’s commonsense reasoning by having it choose the most logical ending for a given story.
hellaswag_multilingual	The multilingual versions of the HellaSwag benchmark.
humaneval_instruct	The HumanEval benchmark measures functional correctness for synthesizing programs from docstrings. - Implementation taken from llama.
ifeval	IFEval is a dataset designed to test a model’s ability to follow explicit instructions, such as “include keyword x” or “use format y.” The focus is on the model’s adherence to formatting instructions rather than the content generated, allowing for the use of strict and rigorous metrics.
m_mmlu_id_str_chat	The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (chat endpoint).
m_mmlu_id_str_completions	The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (completions endpoint).
mbpp_plus_chat	MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (chat endpoint).
mbpp_plus_completions	MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (completions endpoint).
mgsm	The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages.
mgsm_cot_chat	The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the chat endpoint and defaults to chain-of-thought evaluation.
mgsm_cot_completions	The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the completions endpoint and defaults to chain-of-thought evaluation.
mmlu	The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses text generation.
mmlu_cot_0_shot_chat	The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant defaults to chain-of-thought zero-shot evaluation.
mmlu_instruct	The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the chat endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response.
mmlu_instruct_completions	The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the completions endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response.
mmlu_logits	The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the logits of the model to evaluate the accuracy.
mmlu_pro	MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4 (completions endpoint).
mmlu_pro_instruct	MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4. - This variant applies a chat template and defaults to zero-shot evaluation.
mmlu_prox_chat	A Multilingual Benchmark for Advanced Large Language Model Evaluation (chat endpoint)
mmlu_prox_completions	A Multilingual Benchmark for Advanced Large Language Model Evaluation (completions endpoint)
mmlu_prox_de_chat	A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (chat endpoint)
mmlu_prox_de_completions	A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (completions endpoint)
mmlu_prox_es_chat	A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (chat endpoint)
mmlu_prox_es_completions	A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (completions endpoint)
mmlu_prox_fr_chat	A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (chat endpoint)
mmlu_prox_fr_completions	A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (completions endpoint)
mmlu_prox_it_chat	A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (chat endpoint)
mmlu_prox_it_completions	A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (completions endpoint)
mmlu_prox_ja_chat	A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (chat endpoint)
mmlu_prox_ja_completions	A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (completions endpoint)
mmlu_redux	MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects.
mmlu_redux_instruct	MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. - This variant applies a chat template and defaults to zero-shot evaluation.
musr	The MuSR (Multistep Soft Reasoning) benchmark evaluates the reasoning capabilities of large language models through complex, multistep tasks specified in natural language narratives.
openbookqa	OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding of a subject. - Answering OpenBookQA questions requires additional broad common knowledge, not contained in the book. - The questions, by design, are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.
piqa	Physical Interaction: Question Answering (PIQA) is a physical commonsense reasoning benchmark designed to investigate the physical knowledge of large language models.
social_iqa	Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations.
truthfulqa	The TruthfulQA benchmark measures the truthfulness of language models in generating answers to questions. - It consists of 817 questions across 38 categories, such as health, law, finance, and politics, designed to test whether models can avoid generating false answers that mimic common human misconceptions.
wikilingua	The WikiLingua benchmark is a large-scale, multilingual dataset designed for evaluating cross-lingual abstractive summarization systems.
wikitext	The WikiText language modeling dataset is a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. - This task measures perplexity on the WikiText-2 dataset via rolling loglikelihoods.
winogrande	WinoGrande is a collection of 44k problems formulated as a fill-in-a-blank task with binary options testing commonsense reasoning.

adlr_agieval_en_cot#

Version of the AGIEval-EN-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_agieval_en_cot

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_agieval_en_cot
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: adlr_agieval_en_cot
target:
  api_endpoint:
    stream: false

adlr_arc_challenge_llama_25_shot#

ARC-Challenge-Llama version used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_arc_challenge_llama_25_shot

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_arc_challenge_llama
    temperature: 1.0
    request_timeout: 30
    top_p: 1.0
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 25
  supported_endpoint_types:
  - completions
  type: adlr_arc_challenge_llama_25_shot
target:
  api_endpoint:
    stream: false

adlr_commonsense_qa_7_shot#

CommonsenseQA version used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_commonsense_qa_7_shot

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: commonsense_qa
    temperature: 1.0
    request_timeout: 30
    top_p: 1.0
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 7
  supported_endpoint_types:
  - completions
  type: adlr_commonsense_qa_7_shot
target:
  api_endpoint:
    stream: false

adlr_global_mmlu_lite_5_shot#

Global-MMLU subset (8 languages - es, de, fr, zh, it, ja, pt, ko) used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_global_mmlu_lite_5_shot

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_global_mmlu
    temperature: 1.0
    request_timeout: 30
    top_p: 1.0
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
  supported_endpoint_types:
  - completions
  type: adlr_global_mmlu_lite_5_shot
target:
  api_endpoint:
    stream: false

adlr_gpqa_diamond_cot_5_shot#

Version of the GPQA-Diamond-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_gpqa_diamond_cot_5_shot

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_gpqa_diamond_cot_5_shot
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
  supported_endpoint_types:
  - completions
  type: adlr_gpqa_diamond_cot_5_shot
target:
  api_endpoint:
    stream: false

adlr_gsm8k_cot_8_shot#

GSM8K-CoT version used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_gsm8k_cot_8_shot

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_gsm8k_fewshot_cot
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 8
  supported_endpoint_types:
  - completions
  type: adlr_gsm8k_cot_8_shot
target:
  api_endpoint:
    stream: false

adlr_humaneval_greedy#

HumanEval Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_humaneval_greedy

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_humaneval_greedy
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: adlr_humaneval_greedy
target:
  api_endpoint:
    stream: false

adlr_humaneval_sampled#

HumanEval Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_humaneval_sampled

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_humaneval_sampled
    temperature: 0.6
    request_timeout: 30
    top_p: 0.95
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: adlr_humaneval_sampled
target:
  api_endpoint:
    stream: false

adlr_math_500_4_shot_sampled#

MATH-500 Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_math_500_4_shot_sampled

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_math_500_4_shot_sampled
    temperature: 0.7
    request_timeout: 30
    top_p: 1.0
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 4
  supported_endpoint_types:
  - completions
  type: adlr_math_500_4_shot_sampled
target:
  api_endpoint:
    stream: false

adlr_mbpp_sanitized_3_shot_greedy#

MBPP Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_mbpp_sanitized_3_shot_greedy

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_mbpp_sanitized_3_shot_greedy
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 3
  supported_endpoint_types:
  - completions
  type: adlr_mbpp_sanitized_3_shot_greedy
target:
  api_endpoint:
    stream: false

adlr_mbpp_sanitized_3_shot_sampled#

MBPP Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_mbpp_sanitized_3_shot_sampled

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_mbpp_sanitized_3shot_sampled
    temperature: 0.6
    request_timeout: 30
    top_p: 0.95
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 3
  supported_endpoint_types:
  - completions
  type: adlr_mbpp_sanitized_3_shot_sampled
target:
  api_endpoint:
    stream: false

adlr_mgsm_native_cot_8_shot#

MGSM native CoT subset (6 languages - es, de, fr, zh, ja, ru) used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_mgsm_native_cot_8_shot

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_mgsm_native_cot_8_shot
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 8
  supported_endpoint_types:
  - completions
  type: adlr_mgsm_native_cot_8_shot
target:
  api_endpoint:
    stream: false

adlr_minerva_math_nemo_4_shot#

Minerva-Math version used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_minerva_math_nemo_4_shot

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_minerva_math_nemo
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 4
  supported_endpoint_types:
  - completions
  type: adlr_minerva_math_nemo_4_shot
target:
  api_endpoint:
    stream: false

adlr_mmlu#

MMLU version used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_mmlu

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_str
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
      args: --trust_remote_code
  supported_endpoint_types:
  - completions
  type: adlr_mmlu
target:
  api_endpoint:
    stream: false

adlr_mmlu_pro_5_shot_base#

MMLU-Pro 5-shot base version used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_mmlu_pro_5_shot_base

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_mmlu_pro_5_shot_base
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
  supported_endpoint_types:
  - completions
  type: adlr_mmlu_pro_5_shot_base
target:
  api_endpoint:
    stream: false

adlr_race#

RACE version used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_race

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_race
    temperature: 1.0
    request_timeout: 30
    top_p: 1.0
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: adlr_race
target:
  api_endpoint:
    stream: false

adlr_truthfulqa_mc2#

TruthfulQA-MC2 version used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_truthfulqa_mc2

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_truthfulqa_mc2
    temperature: 1.0
    request_timeout: 30
    top_p: 1.0
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: adlr_truthfulqa_mc2
target:
  api_endpoint:
    stream: false

adlr_winogrande_5_shot#

Winogrande version used by NVIDIA Applied Deep Learning Research team (ADLR).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_winogrande_5_shot

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: winogrande
    temperature: 1.0
    request_timeout: 30
    top_p: 1.0
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
  supported_endpoint_types:
  - completions
  type: adlr_winogrande_5_shot
target:
  api_endpoint:
    stream: false

agieval#

AGIEval - A Human-Centric Benchmark for Evaluating Foundation Models

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: agieval

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: agieval
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: agieval
target:
  api_endpoint:
    stream: false

arc_challenge#

The ARC challenge dataset consists of 2,590 multiple-choice science exam questions.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: arc_challenge

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: arc_challenge
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: arc_challenge
target:
  api_endpoint:
    stream: false

arc_challenge_chat#

The ARC challenge dataset consists of 2,590 multiple-choice science exam questions. - This variant applies a chat template and defaults to zero-shot evaluation.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: arc_challenge_chat

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: arc_challenge_chat
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
  supported_endpoint_types:
  - chat
  type: arc_challenge_chat
target:
  api_endpoint:
    stream: false

arc_multilingual#

The multilingual versions of the ARC challenge dataset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: arc_multilingual

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: arc_multilingual
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: arc_multilingual
target:
  api_endpoint:
    stream: false

bbh#

The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: bbh

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: leaderboard_bbh
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: bbh
target:
  api_endpoint:
    stream: false

bbh_instruct#

The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with. - This variant aaplies chat template and defaults to zero-shot evaluation.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: bbh_instruct

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: bbh_zeroshot
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: bbh_instruct
target:
  api_endpoint:
    stream: false

bbq_chat#

The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (chat endpoint).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: bbq_chat

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: bbq_generate
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: bbq_chat
target:
  api_endpoint:
    stream: false

bbq_completions#

The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (completions endpoint).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: bbq_completions

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: bbq_generate
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: bbq_completions
target:
  api_endpoint:
    stream: false

commonsense_qa#

CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers. - It contains 12,102 questions with one correct answer and four distractor answers.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: commonsense_qa

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: commonsense_qa
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 7
  supported_endpoint_types:
  - completions
  type: commonsense_qa
target:
  api_endpoint:
    stream: false

global_mmlu#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - It is designed for efficient evaluation of multilingual models in 15 languages (including English).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu
target:
  api_endpoint:
    stream: false

global_mmlu_ar#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the AR subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_ar

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_ar
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_ar
target:
  api_endpoint:
    stream: false

global_mmlu_bn#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the BN subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_bn

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_bn
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_bn
target:
  api_endpoint:
    stream: false

global_mmlu_de#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the DE subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_de

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_de
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_de
target:
  api_endpoint:
    stream: false

global_mmlu_en#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the EN subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_en

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_en
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_en
target:
  api_endpoint:
    stream: false

global_mmlu_es#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ES subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_es

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_es
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_es
target:
  api_endpoint:
    stream: false

global_mmlu_fr#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the FR subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_fr

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_fr
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_fr
target:
  api_endpoint:
    stream: false

global_mmlu_full#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full
target:
  api_endpoint:
    stream: false

global_mmlu_full_am#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AM subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_am

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_am
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_am
target:
  api_endpoint:
    stream: false

global_mmlu_full_ar#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AR subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ar

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ar
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ar
target:
  api_endpoint:
    stream: false

global_mmlu_full_bn#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the BN subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_bn

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_bn
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_bn
target:
  api_endpoint:
    stream: false

global_mmlu_full_cs#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the CS subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_cs

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_cs
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_cs
target:
  api_endpoint:
    stream: false

global_mmlu_full_de#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the DE subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_de

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_de
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_de
target:
  api_endpoint:
    stream: false

global_mmlu_full_el#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EL subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_el

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_el
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_el
target:
  api_endpoint:
    stream: false

global_mmlu_full_en#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EN subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_en

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_en
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_en
target:
  api_endpoint:
    stream: false

global_mmlu_full_es#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ES subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_es

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_es
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_es
target:
  api_endpoint:
    stream: false

global_mmlu_full_fa#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FA subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_fa

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_fa
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_fa
target:
  api_endpoint:
    stream: false

global_mmlu_full_fil#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FIL subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_fil

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_fil
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_fil
target:
  api_endpoint:
    stream: false

global_mmlu_full_fr#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FR subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_fr

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_fr
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_fr
target:
  api_endpoint:
    stream: false

global_mmlu_full_ha#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HA subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ha

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ha
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ha
target:
  api_endpoint:
    stream: false

global_mmlu_full_he#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HE subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_he

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_he
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_he
target:
  api_endpoint:
    stream: false

global_mmlu_full_hi#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HI subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_hi

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_hi
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_hi
target:
  api_endpoint:
    stream: false

global_mmlu_full_id#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ID subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_id

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_id
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_id
target:
  api_endpoint:
    stream: false

global_mmlu_full_ig#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IG subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ig

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ig
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ig
target:
  api_endpoint:
    stream: false

global_mmlu_full_it#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IT subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_it

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_it
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_it
target:
  api_endpoint:
    stream: false

global_mmlu_full_ja#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the JA subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ja

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ja
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ja
target:
  api_endpoint:
    stream: false

global_mmlu_full_ko#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KO subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ko

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ko
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ko
target:
  api_endpoint:
    stream: false

global_mmlu_full_ky#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KY subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ky

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ky
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ky
target:
  api_endpoint:
    stream: false

global_mmlu_full_lt#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the LT subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_lt

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_lt
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_lt
target:
  api_endpoint:
    stream: false

global_mmlu_full_mg#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MG subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_mg

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_mg
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_mg
target:
  api_endpoint:
    stream: false

global_mmlu_full_ms#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MS subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ms

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ms
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ms
target:
  api_endpoint:
    stream: false

global_mmlu_full_ne#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NE subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ne

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ne
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ne
target:
  api_endpoint:
    stream: false

global_mmlu_full_nl#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NL subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_nl

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_nl
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_nl
target:
  api_endpoint:
    stream: false

global_mmlu_full_ny#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NY subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ny

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ny
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ny
target:
  api_endpoint:
    stream: false

global_mmlu_full_pl#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PL subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_pl

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_pl
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_pl
target:
  api_endpoint:
    stream: false

global_mmlu_full_pt#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PT subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_pt

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_pt
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_pt
target:
  api_endpoint:
    stream: false

global_mmlu_full_ro#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RO subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ro

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ro
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ro
target:
  api_endpoint:
    stream: false

global_mmlu_full_ru#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RU subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ru

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ru
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ru
target:
  api_endpoint:
    stream: false

global_mmlu_full_si#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SI subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_si

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_si
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_si
target:
  api_endpoint:
    stream: false

global_mmlu_full_sn#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SN subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_sn

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_sn
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_sn
target:
  api_endpoint:
    stream: false

global_mmlu_full_so#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SO subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_so

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_so
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_so
target:
  api_endpoint:
    stream: false

global_mmlu_full_sr#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SR subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_sr

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_sr
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_sr
target:
  api_endpoint:
    stream: false

global_mmlu_full_sv#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SV subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_sv

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_sv
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_sv
target:
  api_endpoint:
    stream: false

global_mmlu_full_sw#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SW subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_sw

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_sw
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_sw
target:
  api_endpoint:
    stream: false

global_mmlu_full_te#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TE subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_te

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_te
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_te
target:
  api_endpoint:
    stream: false

global_mmlu_full_tr#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TR subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_tr

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_tr
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_tr
target:
  api_endpoint:
    stream: false

global_mmlu_full_uk#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the UK subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_uk

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_uk
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_uk
target:
  api_endpoint:
    stream: false

global_mmlu_full_vi#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the VI subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_vi

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_vi
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_vi
target:
  api_endpoint:
    stream: false

global_mmlu_full_yo#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the YO subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_yo

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_yo
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_yo
target:
  api_endpoint:
    stream: false

global_mmlu_full_zh#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ZH subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_zh

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_zh
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_zh
target:
  api_endpoint:
    stream: false

global_mmlu_hi#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the HI subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_hi

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_hi
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_hi
target:
  api_endpoint:
    stream: false

global_mmlu_id#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ID subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_id

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_id
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_id
target:
  api_endpoint:
    stream: false

global_mmlu_it#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the IT subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_it

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_it
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_it
target:
  api_endpoint:
    stream: false

global_mmlu_ja#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the JA subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_ja

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_ja
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_ja
target:
  api_endpoint:
    stream: false

global_mmlu_ko#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the KO subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_ko

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_ko
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_ko
target:
  api_endpoint:
    stream: false

global_mmlu_pt#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the PT subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_pt

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_pt
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_pt
target:
  api_endpoint:
    stream: false

global_mmlu_sw#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the SW subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_sw

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_sw
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_sw
target:
  api_endpoint:
    stream: false

global_mmlu_yo#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the YO subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_yo

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_yo
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_yo
target:
  api_endpoint:
    stream: false

global_mmlu_zh#

Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ZH subset.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_zh

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_zh
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_zh
target:
  api_endpoint:
    stream: false

gpqa#

The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: gpqa

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: leaderboard_gpqa
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: gpqa
target:
  api_endpoint:
    stream: false

gpqa_diamond_cot#

The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry. - This variant uses the Diamond subset and defaults to zero-shot chain-of-thought evaluation.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: gpqa_diamond_cot

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: gpqa_diamond_cot_zeroshot
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: gpqa_diamond_cot
target:
  api_endpoint:
    stream: false

gsm8k#

The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: gsm8k

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: gsm8k
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: gsm8k
target:
  api_endpoint:
    stream: false

gsm8k_cot_instruct#

The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation with custom instructions.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: gsm8k_cot_instruct

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: gsm8k_zeroshot_cot
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      args: --add_instruction
  supported_endpoint_types:
  - chat
  type: gsm8k_cot_instruct
target:
  api_endpoint:
    stream: false

gsm8k_cot_llama#

The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought evaluation - implementation taken from llama.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: gsm8k_cot_llama

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: gsm8k_cot_llama
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: gsm8k_cot_llama
target:
  api_endpoint:
    stream: false

gsm8k_cot_zeroshot#

The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: gsm8k_cot_zeroshot

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: gsm8k_cot_zeroshot
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: gsm8k_cot_zeroshot
target:
  api_endpoint:
    stream: false

gsm8k_cot_zeroshot_llama#

The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation - implementation taken from llama.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: gsm8k_cot_zeroshot_llama

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: gsm8k_cot_llama
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
  supported_endpoint_types:
  - chat
  type: gsm8k_cot_zeroshot_llama
target:
  api_endpoint:
    stream: false

hellaswag#

The HellaSwag benchmark tests a language model’s commonsense reasoning by having it choose the most logical ending for a given story.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: hellaswag

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: hellaswag
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 10
  supported_endpoint_types:
  - completions
  type: hellaswag
target:
  api_endpoint:
    stream: false

hellaswag_multilingual#

The multilingual versions of the HellaSwag benchmark.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: hellaswag_multilingual

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: hellaswag_multilingual
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 10
  supported_endpoint_types:
  - completions
  type: hellaswag_multilingual
target:
  api_endpoint:
    stream: false

humaneval_instruct#

The HumanEval benchmark measures functional correctness for synthesizing programs from docstrings. - Implementation taken from llama.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: humaneval_instruct

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: humaneval_instruct
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: humaneval_instruct
target:
  api_endpoint:
    stream: false

ifeval#

IFEval is a dataset designed to test a model’s ability to follow explicit instructions, such as “include keyword x” or “use format y.” The focus is on the model’s adherence to formatting instructions rather than the content generated, allowing for the use of strict and rigorous metrics.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: ifeval

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: ifeval
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: ifeval
target:
  api_endpoint:
    stream: false

m_mmlu_id_str_chat#

The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (chat endpoint).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: m_mmlu_id_str_chat

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: m_mmlu_id_str
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
      args: --trust_remote_code
  supported_endpoint_types:
  - chat
  type: m_mmlu_id_str_chat
target:
  api_endpoint:
    stream: false

m_mmlu_id_str_completions#

The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (completions endpoint).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: m_mmlu_id_str_completions

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: m_mmlu_id_str
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
      args: --trust_remote_code
  supported_endpoint_types:
  - completions
  type: m_mmlu_id_str_completions
target:
  api_endpoint:
    stream: false

mbpp_plus_chat#

MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (chat endpoint).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mbpp_plus_chat

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mbpp_plus
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      args: --confirm_run_unsafe_code
  supported_endpoint_types:
  - chat
  type: mbpp_plus_chat
target:
  api_endpoint:
    stream: false

mbpp_plus_completions#

MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (completions endpoint).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mbpp_plus_completions

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mbpp_plus
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      args: --confirm_run_unsafe_code
  supported_endpoint_types:
  - completions
  type: mbpp_plus_completions
target:
  api_endpoint:
    stream: false

mgsm#

The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mgsm

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mgsm_direct
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mgsm
target:
  api_endpoint:
    stream: false

mgsm_cot_chat#

The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the chat endpoint and defaults to chain-of-thought evaluation.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mgsm_cot_chat

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: mgsm_cot_native
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
  supported_endpoint_types:
  - chat
  type: mgsm_cot_chat
target:
  api_endpoint:
    stream: false

mgsm_cot_completions#

The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the completions endpoint and defaults to chain-of-thought evaluation.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mgsm_cot_completions

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: mgsm_cot_native
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
  supported_endpoint_types:
  - completions
  type: mgsm_cot_completions
target:
  api_endpoint:
    stream: false

mmlu#

The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses text generation.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_str
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
      args: --trust_remote_code
  supported_endpoint_types:
  - completions
  type: mmlu
target:
  api_endpoint:
    stream: false

mmlu_cot_0_shot_chat#

The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant defaults to chain-of-thought zero-shot evaluation.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_cot_0_shot_chat

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_cot_0_shot_chat
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      args: --trust_remote_code
  supported_endpoint_types:
  - chat
  type: mmlu_cot_0_shot_chat
target:
  api_endpoint:
    stream: false

mmlu_instruct#

The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the chat endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_instruct

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_str
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
      args: --trust_remote_code --add_instruction
  supported_endpoint_types:
  - chat
  type: mmlu_instruct
target:
  api_endpoint:
    stream: false

mmlu_instruct_completions#

The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the completions endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_instruct_completions

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_str
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
      args: --trust_remote_code --add_instruction
  supported_endpoint_types:
  - completions
  type: mmlu_instruct_completions
target:
  api_endpoint:
    stream: false

mmlu_logits#

The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the logits of the model to evaluate the accuracy.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_logits

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
  supported_endpoint_types:
  - completions
  type: mmlu_logits
target:
  api_endpoint:
    stream: false

mmlu_pro#

MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4 (completions endpoint).

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_pro

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_pro
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
  supported_endpoint_types:
  - completions
  type: mmlu_pro
target:
  api_endpoint:
    stream: false

mmlu_pro_instruct#

MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4. - This variant applies a chat template and defaults to zero-shot evaluation.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_pro_instruct

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: mmlu_pro
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
  supported_endpoint_types:
  - chat
  type: mmlu_pro_instruct
target:
  api_endpoint:
    stream: false

mmlu_prox_chat#

A Multilingual Benchmark for Advanced Large Language Model Evaluation (chat endpoint)

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_chat

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: mmlu_prox_chat
target:
  api_endpoint:
    stream: false

mmlu_prox_completions#

A Multilingual Benchmark for Advanced Large Language Model Evaluation (completions endpoint)

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_completions

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mmlu_prox_completions
target:
  api_endpoint:
    stream: false

mmlu_prox_de_chat#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (chat endpoint)

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_de_chat

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_de
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: mmlu_prox_de_chat
target:
  api_endpoint:
    stream: false

mmlu_prox_de_completions#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (completions endpoint)

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_de_completions

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_de
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mmlu_prox_de_completions
target:
  api_endpoint:
    stream: false

mmlu_prox_es_chat#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (chat endpoint)

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_es_chat

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_es
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: mmlu_prox_es_chat
target:
  api_endpoint:
    stream: false

mmlu_prox_es_completions#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (completions endpoint)

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_es_completions

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_es
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mmlu_prox_es_completions
target:
  api_endpoint:
    stream: false

mmlu_prox_fr_chat#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (chat endpoint)

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_fr_chat

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_fr
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: mmlu_prox_fr_chat
target:
  api_endpoint:
    stream: false

mmlu_prox_fr_completions#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (completions endpoint)

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_fr_completions

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_fr
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mmlu_prox_fr_completions
target:
  api_endpoint:
    stream: false

mmlu_prox_it_chat#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (chat endpoint)

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_it_chat

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_it
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: mmlu_prox_it_chat
target:
  api_endpoint:
    stream: false

mmlu_prox_it_completions#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (completions endpoint)

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_it_completions

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_it
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mmlu_prox_it_completions
target:
  api_endpoint:
    stream: false

mmlu_prox_ja_chat#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (chat endpoint)

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_ja_chat

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_ja
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: mmlu_prox_ja_chat
target:
  api_endpoint:
    stream: false

mmlu_prox_ja_completions#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (completions endpoint)

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_ja_completions

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_ja
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mmlu_prox_ja_completions
target:
  api_endpoint:
    stream: false

mmlu_redux#

MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_redux

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_redux
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mmlu_redux
target:
  api_endpoint:
    stream: false

mmlu_redux_instruct#

MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. - This variant applies a chat template and defaults to zero-shot evaluation.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_redux_instruct

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 8192
    max_retries: 5
    parallelism: 10
    task: mmlu_redux
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
      args: --add_instruction
  supported_endpoint_types:
  - chat
  type: mmlu_redux_instruct
target:
  api_endpoint:
    stream: false

musr#

The MuSR (Multistep Soft Reasoning) benchmark evaluates the reasoning capabilities of large language models through complex, multistep tasks specified in natural language narratives.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: musr

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: leaderboard_musr
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: musr
target:
  api_endpoint:
    stream: false

openbookqa#

OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding of a subject. - Answering OpenBookQA questions requires additional broad common knowledge, not contained in the book. - The questions, by design, are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: openbookqa

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: openbookqa
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: openbookqa
target:
  api_endpoint:
    stream: false

piqa#

Physical Interaction: Question Answering (PIQA) is a physical commonsense reasoning benchmark designed to investigate the physical knowledge of large language models.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: piqa

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: piqa
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: piqa
target:
  api_endpoint:
    stream: false

social_iqa#

Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: social_iqa

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: social_iqa
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      args: --trust_remote_code
  supported_endpoint_types:
  - completions
  type: social_iqa
target:
  api_endpoint:
    stream: false

truthfulqa#

The TruthfulQA benchmark measures the truthfulness of language models in generating answers to questions. - It consists of 817 questions across 38 categories, such as health, law, finance, and politics, designed to test whether models can avoid generating false answers that mimic common human misconceptions.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: truthfulqa

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: truthfulqa
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: truthfulqa
target:
  api_endpoint:
    stream: false

wikilingua#

The WikiLingua benchmark is a large-scale, multilingual dataset designed for evaluating cross-lingual abstractive summarization systems.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: wikilingua

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: wikilingua
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      args: --trust_remote_code
  supported_endpoint_types:
  - chat
  type: wikilingua
target:
  api_endpoint:
    stream: false

wikitext#

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. - This task measures perplexity on the WikiText-2 dataset via rolling loglikelihoods.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: wikitext

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: wikitext
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      args: --trust_remote_code
  supported_endpoint_types:
  - completions
  type: wikitext
target:
  api_endpoint:
    stream: false

winogrande#

WinoGrande is a collection of 44k problems formulated as a fill-in-a-blank task with binary options testing commonsense reasoning.

Container

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: winogrande

Command

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}

Defaults

framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: winogrande
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
  supported_endpoint_types:
  - completions
  type: winogrande
target:
  api_endpoint:
    stream: false