lm-evaluation-harness#

This page contains all evaluation tasks for the lm-evaluation-harness harness.

Task

Description

adlr_agieval_en_cot

Version of the AGIEval-EN-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_arc_challenge_llama_25_shot

ARC-Challenge-Llama version used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_commonsense_qa_7_shot

CommonsenseQA version used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_global_mmlu_lite_5_shot

Global-MMLU subset (8 languages - es, de, fr, zh, it, ja, pt, ko) used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_gpqa_diamond_cot_5_shot

Version of the GPQA-Diamond-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_gsm8k_cot_8_shot

GSM8K-CoT version used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_humaneval_greedy

HumanEval Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_humaneval_sampled

HumanEval Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_math_500_4_shot_sampled

MATH-500 Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_mbpp_sanitized_3_shot_greedy

MBPP Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_mbpp_sanitized_3_shot_sampled

MBPP Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_mgsm_native_cot_8_shot

MGSM native CoT subset (6 languages - es, de, fr, zh, ja, ru) used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_minerva_math_nemo_4_shot

Minerva-Math version used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_mmlu

MMLU version used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_mmlu_pro_5_shot_base

MMLU-Pro 5-shot base version used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_race

RACE version used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_truthfulqa_mc2

TruthfulQA-MC2 version used by NVIDIA Applied Deep Learning Research team (ADLR).

adlr_winogrande_5_shot

Winogrande version used by NVIDIA Applied Deep Learning Research team (ADLR).

agieval

AGIEval - A Human-Centric Benchmark for Evaluating Foundation Models

arc_challenge

The ARC challenge dataset consists of 2,590 multiple-choice science exam questions.

arc_challenge_chat

  • The ARC challenge dataset consists of 2,590 multiple-choice science exam questions. - This variant applies a chat template and defaults to zero-shot evaluation.

arc_multilingual

The multilingual versions of the ARC challenge dataset.

bbh

The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with.

bbh_instruct

  • The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with. - This variant aaplies chat template and defaults to zero-shot evaluation.

bbq_chat

The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (chat endpoint).

bbq_completions

The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (completions endpoint).

commonsense_qa

  • CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers. - It contains 12,102 questions with one correct answer and four distractor answers.

global_mmlu

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - It is designed for efficient evaluation of multilingual models in 15 languages (including English).

global_mmlu_ar

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the AR subset.

global_mmlu_bn

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the BN subset.

global_mmlu_de

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the DE subset.

global_mmlu_en

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the EN subset.

global_mmlu_es

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ES subset.

global_mmlu_fr

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the FR subset.

global_mmlu_full

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English.

global_mmlu_full_am

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AM subset.

global_mmlu_full_ar

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AR subset.

global_mmlu_full_bn

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the BN subset.

global_mmlu_full_cs

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the CS subset.

global_mmlu_full_de

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the DE subset.

global_mmlu_full_el

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EL subset.

global_mmlu_full_en

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EN subset.

global_mmlu_full_es

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ES subset.

global_mmlu_full_fa

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FA subset.

global_mmlu_full_fil

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FIL subset.

global_mmlu_full_fr

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FR subset.

global_mmlu_full_ha

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HA subset.

global_mmlu_full_he

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HE subset.

global_mmlu_full_hi

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HI subset.

global_mmlu_full_id

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ID subset.

global_mmlu_full_ig

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IG subset.

global_mmlu_full_it

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IT subset.

global_mmlu_full_ja

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the JA subset.

global_mmlu_full_ko

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KO subset.

global_mmlu_full_ky

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KY subset.

global_mmlu_full_lt

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the LT subset.

global_mmlu_full_mg

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MG subset.

global_mmlu_full_ms

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MS subset.

global_mmlu_full_ne

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NE subset.

global_mmlu_full_nl

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NL subset.

global_mmlu_full_ny

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NY subset.

global_mmlu_full_pl

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PL subset.

global_mmlu_full_pt

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PT subset.

global_mmlu_full_ro

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RO subset.

global_mmlu_full_ru

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RU subset.

global_mmlu_full_si

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SI subset.

global_mmlu_full_sn

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SN subset.

global_mmlu_full_so

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SO subset.

global_mmlu_full_sr

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SR subset.

global_mmlu_full_sv

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SV subset.

global_mmlu_full_sw

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SW subset.

global_mmlu_full_te

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TE subset.

global_mmlu_full_tr

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TR subset.

global_mmlu_full_uk

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the UK subset.

global_mmlu_full_vi

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the VI subset.

global_mmlu_full_yo

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the YO subset.

global_mmlu_full_zh

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ZH subset.

global_mmlu_hi

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the HI subset.

global_mmlu_id

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ID subset.

global_mmlu_it

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the IT subset.

global_mmlu_ja

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the JA subset.

global_mmlu_ko

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the KO subset.

global_mmlu_pt

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the PT subset.

global_mmlu_sw

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the SW subset.

global_mmlu_yo

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the YO subset.

global_mmlu_zh

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ZH subset.

gpqa

The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry.

gpqa_diamond_cot

  • The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry. - This variant uses the Diamond subset and defaults to zero-shot chain-of-thought evaluation.

gsm8k

The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems.

gsm8k_cot_instruct

  • The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation with custom instructions.

gsm8k_cot_llama

  • The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought evaluation - implementation taken from llama.

gsm8k_cot_zeroshot

  • The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation.

gsm8k_cot_zeroshot_llama

  • The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation - implementation taken from llama.

hellaswag

The HellaSwag benchmark tests a language model’s commonsense reasoning by having it choose the most logical ending for a given story.

hellaswag_multilingual

The multilingual versions of the HellaSwag benchmark.

humaneval_instruct

  • The HumanEval benchmark measures functional correctness for synthesizing programs from docstrings. - Implementation taken from llama.

ifeval

IFEval is a dataset designed to test a model’s ability to follow explicit instructions, such as “include keyword x” or “use format y.” The focus is on the model’s adherence to formatting instructions rather than the content generated, allowing for the use of strict and rigorous metrics.

m_mmlu_id_str_chat

  • The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (chat endpoint).

m_mmlu_id_str_completions

  • The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (completions endpoint).

mbpp_plus_chat

MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (chat endpoint).

mbpp_plus_completions

MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (completions endpoint).

mgsm

  • The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages.

mgsm_cot_chat

  • The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the chat endpoint and defaults to chain-of-thought evaluation.

mgsm_cot_completions

  • The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the completions endpoint and defaults to chain-of-thought evaluation.

mmlu

  • The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses text generation.

mmlu_cot_0_shot_chat

  • The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant defaults to chain-of-thought zero-shot evaluation.

mmlu_instruct

  • The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the chat endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response.

mmlu_instruct_completions

  • The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the completions endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response.

mmlu_logits

  • The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the logits of the model to evaluate the accuracy.

mmlu_pro

MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4 (completions endpoint).

mmlu_pro_instruct

  • MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4. - This variant applies a chat template and defaults to zero-shot evaluation.

mmlu_prox_chat

A Multilingual Benchmark for Advanced Large Language Model Evaluation (chat endpoint)

mmlu_prox_completions

A Multilingual Benchmark for Advanced Large Language Model Evaluation (completions endpoint)

mmlu_prox_de_chat

A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (chat endpoint)

mmlu_prox_de_completions

A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (completions endpoint)

mmlu_prox_es_chat

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (chat endpoint)

mmlu_prox_es_completions

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (completions endpoint)

mmlu_prox_fr_chat

A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (chat endpoint)

mmlu_prox_fr_completions

A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (completions endpoint)

mmlu_prox_it_chat

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (chat endpoint)

mmlu_prox_it_completions

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (completions endpoint)

mmlu_prox_ja_chat

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (chat endpoint)

mmlu_prox_ja_completions

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (completions endpoint)

mmlu_redux

MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects.

mmlu_redux_instruct

  • MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. - This variant applies a chat template and defaults to zero-shot evaluation.

musr

The MuSR (Multistep Soft Reasoning) benchmark evaluates the reasoning capabilities of large language models through complex, multistep tasks specified in natural language narratives.

openbookqa

  • OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding of a subject. - Answering OpenBookQA questions requires additional broad common knowledge, not contained in the book. - The questions, by design, are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.

piqa

  • Physical Interaction: Question Answering (PIQA) is a physical commonsense reasoning benchmark designed to investigate the physical knowledge of large language models.

social_iqa

  • Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations.

truthfulqa

  • The TruthfulQA benchmark measures the truthfulness of language models in generating answers to questions. - It consists of 817 questions across 38 categories, such as health, law, finance, and politics, designed to test whether models can avoid generating false answers that mimic common human misconceptions.

wikilingua

  • The WikiLingua benchmark is a large-scale, multilingual dataset designed for evaluating cross-lingual abstractive summarization systems.

wikitext

  • The WikiText language modeling dataset is a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. - This task measures perplexity on the WikiText-2 dataset via rolling loglikelihoods.

winogrande

WinoGrande is a collection of 44k problems formulated as a fill-in-a-blank task with binary options testing commonsense reasoning.

adlr_agieval_en_cot#

Version of the AGIEval-EN-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_agieval_en_cot

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_agieval_en_cot
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: adlr_agieval_en_cot
target:
  api_endpoint:
    stream: false

adlr_arc_challenge_llama_25_shot#

ARC-Challenge-Llama version used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_arc_challenge_llama_25_shot

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_arc_challenge_llama
    temperature: 1.0
    request_timeout: 30
    top_p: 1.0
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 25
  supported_endpoint_types:
  - completions
  type: adlr_arc_challenge_llama_25_shot
target:
  api_endpoint:
    stream: false

adlr_commonsense_qa_7_shot#

CommonsenseQA version used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_commonsense_qa_7_shot

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: commonsense_qa
    temperature: 1.0
    request_timeout: 30
    top_p: 1.0
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 7
  supported_endpoint_types:
  - completions
  type: adlr_commonsense_qa_7_shot
target:
  api_endpoint:
    stream: false

adlr_global_mmlu_lite_5_shot#

Global-MMLU subset (8 languages - es, de, fr, zh, it, ja, pt, ko) used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_global_mmlu_lite_5_shot

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_global_mmlu
    temperature: 1.0
    request_timeout: 30
    top_p: 1.0
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
  supported_endpoint_types:
  - completions
  type: adlr_global_mmlu_lite_5_shot
target:
  api_endpoint:
    stream: false

adlr_gpqa_diamond_cot_5_shot#

Version of the GPQA-Diamond-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_gpqa_diamond_cot_5_shot

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_gpqa_diamond_cot_5_shot
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
  supported_endpoint_types:
  - completions
  type: adlr_gpqa_diamond_cot_5_shot
target:
  api_endpoint:
    stream: false

adlr_gsm8k_cot_8_shot#

GSM8K-CoT version used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_gsm8k_cot_8_shot

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_gsm8k_fewshot_cot
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 8
  supported_endpoint_types:
  - completions
  type: adlr_gsm8k_cot_8_shot
target:
  api_endpoint:
    stream: false

adlr_humaneval_greedy#

HumanEval Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_humaneval_greedy

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_humaneval_greedy
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: adlr_humaneval_greedy
target:
  api_endpoint:
    stream: false

adlr_humaneval_sampled#

HumanEval Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_humaneval_sampled

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_humaneval_sampled
    temperature: 0.6
    request_timeout: 30
    top_p: 0.95
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: adlr_humaneval_sampled
target:
  api_endpoint:
    stream: false

adlr_math_500_4_shot_sampled#

MATH-500 Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_math_500_4_shot_sampled

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_math_500_4_shot_sampled
    temperature: 0.7
    request_timeout: 30
    top_p: 1.0
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 4
  supported_endpoint_types:
  - completions
  type: adlr_math_500_4_shot_sampled
target:
  api_endpoint:
    stream: false

adlr_mbpp_sanitized_3_shot_greedy#

MBPP Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_mbpp_sanitized_3_shot_greedy

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_mbpp_sanitized_3_shot_greedy
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 3
  supported_endpoint_types:
  - completions
  type: adlr_mbpp_sanitized_3_shot_greedy
target:
  api_endpoint:
    stream: false

adlr_mbpp_sanitized_3_shot_sampled#

MBPP Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_mbpp_sanitized_3_shot_sampled

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_mbpp_sanitized_3shot_sampled
    temperature: 0.6
    request_timeout: 30
    top_p: 0.95
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 3
  supported_endpoint_types:
  - completions
  type: adlr_mbpp_sanitized_3_shot_sampled
target:
  api_endpoint:
    stream: false

adlr_mgsm_native_cot_8_shot#

MGSM native CoT subset (6 languages - es, de, fr, zh, ja, ru) used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_mgsm_native_cot_8_shot

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_mgsm_native_cot_8_shot
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 8
  supported_endpoint_types:
  - completions
  type: adlr_mgsm_native_cot_8_shot
target:
  api_endpoint:
    stream: false

adlr_minerva_math_nemo_4_shot#

Minerva-Math version used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_minerva_math_nemo_4_shot

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_minerva_math_nemo
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 4
  supported_endpoint_types:
  - completions
  type: adlr_minerva_math_nemo_4_shot
target:
  api_endpoint:
    stream: false

adlr_mmlu#

MMLU version used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_mmlu

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_str
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
      args: --trust_remote_code
  supported_endpoint_types:
  - completions
  type: adlr_mmlu
target:
  api_endpoint:
    stream: false

adlr_mmlu_pro_5_shot_base#

MMLU-Pro 5-shot base version used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_mmlu_pro_5_shot_base

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_mmlu_pro_5_shot_base
    temperature: 0.0
    request_timeout: 30
    top_p: 1.0e-05
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
  supported_endpoint_types:
  - completions
  type: adlr_mmlu_pro_5_shot_base
target:
  api_endpoint:
    stream: false

adlr_race#

RACE version used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_race

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_race
    temperature: 1.0
    request_timeout: 30
    top_p: 1.0
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: adlr_race
target:
  api_endpoint:
    stream: false

adlr_truthfulqa_mc2#

TruthfulQA-MC2 version used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_truthfulqa_mc2

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: adlr_truthfulqa_mc2
    temperature: 1.0
    request_timeout: 30
    top_p: 1.0
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: adlr_truthfulqa_mc2
target:
  api_endpoint:
    stream: false

adlr_winogrande_5_shot#

Winogrande version used by NVIDIA Applied Deep Learning Research team (ADLR).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: adlr_winogrande_5_shot

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: winogrande
    temperature: 1.0
    request_timeout: 30
    top_p: 1.0
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
  supported_endpoint_types:
  - completions
  type: adlr_winogrande_5_shot
target:
  api_endpoint:
    stream: false

agieval#

AGIEval - A Human-Centric Benchmark for Evaluating Foundation Models

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: agieval

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: agieval
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: agieval
target:
  api_endpoint:
    stream: false

arc_challenge#

The ARC challenge dataset consists of 2,590 multiple-choice science exam questions.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: arc_challenge

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: arc_challenge
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: arc_challenge
target:
  api_endpoint:
    stream: false

arc_challenge_chat#

  • The ARC challenge dataset consists of 2,590 multiple-choice science exam questions. - This variant applies a chat template and defaults to zero-shot evaluation.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: arc_challenge_chat

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: arc_challenge_chat
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
  supported_endpoint_types:
  - chat
  type: arc_challenge_chat
target:
  api_endpoint:
    stream: false

arc_multilingual#

The multilingual versions of the ARC challenge dataset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: arc_multilingual

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: arc_multilingual
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: arc_multilingual
target:
  api_endpoint:
    stream: false

bbh#

The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: bbh

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: leaderboard_bbh
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: bbh
target:
  api_endpoint:
    stream: false

bbh_instruct#

  • The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with. - This variant aaplies chat template and defaults to zero-shot evaluation.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: bbh_instruct

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: bbh_zeroshot
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: bbh_instruct
target:
  api_endpoint:
    stream: false

bbq_chat#

The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (chat endpoint).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: bbq_chat

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: bbq_generate
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: bbq_chat
target:
  api_endpoint:
    stream: false

bbq_completions#

The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (completions endpoint).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: bbq_completions

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: bbq_generate
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: bbq_completions
target:
  api_endpoint:
    stream: false

commonsense_qa#

  • CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers. - It contains 12,102 questions with one correct answer and four distractor answers.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: commonsense_qa

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: commonsense_qa
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 7
  supported_endpoint_types:
  - completions
  type: commonsense_qa
target:
  api_endpoint:
    stream: false

global_mmlu#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - It is designed for efficient evaluation of multilingual models in 15 languages (including English).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu
target:
  api_endpoint:
    stream: false

global_mmlu_ar#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the AR subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_ar

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_ar
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_ar
target:
  api_endpoint:
    stream: false

global_mmlu_bn#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the BN subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_bn

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_bn
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_bn
target:
  api_endpoint:
    stream: false

global_mmlu_de#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the DE subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_de

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_de
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_de
target:
  api_endpoint:
    stream: false

global_mmlu_en#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the EN subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_en

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_en
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_en
target:
  api_endpoint:
    stream: false

global_mmlu_es#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ES subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_es

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_es
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_es
target:
  api_endpoint:
    stream: false

global_mmlu_fr#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the FR subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_fr

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_fr
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_fr
target:
  api_endpoint:
    stream: false

global_mmlu_full#

Global-MMLU is a multilingual evaluation set spanning 42 languages, including English.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full
target:
  api_endpoint:
    stream: false

global_mmlu_full_am#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AM subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_am

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_am
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_am
target:
  api_endpoint:
    stream: false

global_mmlu_full_ar#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AR subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ar

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ar
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ar
target:
  api_endpoint:
    stream: false

global_mmlu_full_bn#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the BN subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_bn

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_bn
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_bn
target:
  api_endpoint:
    stream: false

global_mmlu_full_cs#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the CS subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_cs

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_cs
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_cs
target:
  api_endpoint:
    stream: false

global_mmlu_full_de#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the DE subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_de

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_de
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_de
target:
  api_endpoint:
    stream: false

global_mmlu_full_el#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EL subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_el

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_el
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_el
target:
  api_endpoint:
    stream: false

global_mmlu_full_en#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EN subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_en

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_en
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_en
target:
  api_endpoint:
    stream: false

global_mmlu_full_es#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ES subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_es

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_es
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_es
target:
  api_endpoint:
    stream: false

global_mmlu_full_fa#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FA subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_fa

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_fa
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_fa
target:
  api_endpoint:
    stream: false

global_mmlu_full_fil#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FIL subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_fil

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_fil
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_fil
target:
  api_endpoint:
    stream: false

global_mmlu_full_fr#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FR subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_fr

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_fr
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_fr
target:
  api_endpoint:
    stream: false

global_mmlu_full_ha#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HA subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ha

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ha
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ha
target:
  api_endpoint:
    stream: false

global_mmlu_full_he#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HE subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_he

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_he
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_he
target:
  api_endpoint:
    stream: false

global_mmlu_full_hi#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HI subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_hi

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_hi
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_hi
target:
  api_endpoint:
    stream: false

global_mmlu_full_id#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ID subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_id

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_id
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_id
target:
  api_endpoint:
    stream: false

global_mmlu_full_ig#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IG subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ig

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ig
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ig
target:
  api_endpoint:
    stream: false

global_mmlu_full_it#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IT subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_it

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_it
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_it
target:
  api_endpoint:
    stream: false

global_mmlu_full_ja#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the JA subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ja

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ja
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ja
target:
  api_endpoint:
    stream: false

global_mmlu_full_ko#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KO subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ko

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ko
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ko
target:
  api_endpoint:
    stream: false

global_mmlu_full_ky#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KY subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ky

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ky
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ky
target:
  api_endpoint:
    stream: false

global_mmlu_full_lt#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the LT subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_lt

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_lt
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_lt
target:
  api_endpoint:
    stream: false

global_mmlu_full_mg#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MG subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_mg

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_mg
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_mg
target:
  api_endpoint:
    stream: false

global_mmlu_full_ms#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MS subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ms

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ms
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ms
target:
  api_endpoint:
    stream: false

global_mmlu_full_ne#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NE subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ne

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ne
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ne
target:
  api_endpoint:
    stream: false

global_mmlu_full_nl#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NL subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_nl

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_nl
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_nl
target:
  api_endpoint:
    stream: false

global_mmlu_full_ny#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NY subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ny

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ny
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ny
target:
  api_endpoint:
    stream: false

global_mmlu_full_pl#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PL subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_pl

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_pl
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_pl
target:
  api_endpoint:
    stream: false

global_mmlu_full_pt#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PT subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_pt

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_pt
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_pt
target:
  api_endpoint:
    stream: false

global_mmlu_full_ro#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RO subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ro

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ro
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ro
target:
  api_endpoint:
    stream: false

global_mmlu_full_ru#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RU subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_ru

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_ru
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_ru
target:
  api_endpoint:
    stream: false

global_mmlu_full_si#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SI subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_si

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_si
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_si
target:
  api_endpoint:
    stream: false

global_mmlu_full_sn#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SN subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_sn

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_sn
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_sn
target:
  api_endpoint:
    stream: false

global_mmlu_full_so#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SO subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_so

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_so
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_so
target:
  api_endpoint:
    stream: false

global_mmlu_full_sr#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SR subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_sr

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_sr
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_sr
target:
  api_endpoint:
    stream: false

global_mmlu_full_sv#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SV subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_sv

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_sv
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_sv
target:
  api_endpoint:
    stream: false

global_mmlu_full_sw#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SW subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_sw

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_sw
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_sw
target:
  api_endpoint:
    stream: false

global_mmlu_full_te#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TE subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_te

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_te
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_te
target:
  api_endpoint:
    stream: false

global_mmlu_full_tr#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TR subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_tr

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_tr
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_tr
target:
  api_endpoint:
    stream: false

global_mmlu_full_uk#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the UK subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_uk

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_uk
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_uk
target:
  api_endpoint:
    stream: false

global_mmlu_full_vi#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the VI subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_vi

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_vi
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_vi
target:
  api_endpoint:
    stream: false

global_mmlu_full_yo#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the YO subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_yo

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_yo
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_yo
target:
  api_endpoint:
    stream: false

global_mmlu_full_zh#

  • Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ZH subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_full_zh

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_full_zh
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_full_zh
target:
  api_endpoint:
    stream: false

global_mmlu_hi#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the HI subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_hi

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_hi
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_hi
target:
  api_endpoint:
    stream: false

global_mmlu_id#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ID subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_id

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_id
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_id
target:
  api_endpoint:
    stream: false

global_mmlu_it#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the IT subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_it

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_it
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_it
target:
  api_endpoint:
    stream: false

global_mmlu_ja#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the JA subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_ja

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_ja
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_ja
target:
  api_endpoint:
    stream: false

global_mmlu_ko#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the KO subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_ko

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_ko
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_ko
target:
  api_endpoint:
    stream: false

global_mmlu_pt#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the PT subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_pt

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_pt
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_pt
target:
  api_endpoint:
    stream: false

global_mmlu_sw#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the SW subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_sw

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_sw
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_sw
target:
  api_endpoint:
    stream: false

global_mmlu_yo#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the YO subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_yo

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_yo
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_yo
target:
  api_endpoint:
    stream: false

global_mmlu_zh#

  • Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ZH subset.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: global_mmlu_zh

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: global_mmlu_zh
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: global_mmlu_zh
target:
  api_endpoint:
    stream: false

gpqa#

The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: gpqa

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: leaderboard_gpqa
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: gpqa
target:
  api_endpoint:
    stream: false

gpqa_diamond_cot#

  • The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry. - This variant uses the Diamond subset and defaults to zero-shot chain-of-thought evaluation.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: gpqa_diamond_cot

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: gpqa_diamond_cot_zeroshot
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: gpqa_diamond_cot
target:
  api_endpoint:
    stream: false

gsm8k#

The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: gsm8k

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: gsm8k
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: gsm8k
target:
  api_endpoint:
    stream: false

gsm8k_cot_instruct#

  • The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation with custom instructions.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: gsm8k_cot_instruct

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: gsm8k_zeroshot_cot
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      args: --add_instruction
  supported_endpoint_types:
  - chat
  type: gsm8k_cot_instruct
target:
  api_endpoint:
    stream: false

gsm8k_cot_llama#

  • The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought evaluation - implementation taken from llama.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: gsm8k_cot_llama

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: gsm8k_cot_llama
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: gsm8k_cot_llama
target:
  api_endpoint:
    stream: false

gsm8k_cot_zeroshot#

  • The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: gsm8k_cot_zeroshot

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: gsm8k_cot_zeroshot
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: gsm8k_cot_zeroshot
target:
  api_endpoint:
    stream: false

gsm8k_cot_zeroshot_llama#

  • The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation - implementation taken from llama.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: gsm8k_cot_zeroshot_llama

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: gsm8k_cot_llama
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
  supported_endpoint_types:
  - chat
  type: gsm8k_cot_zeroshot_llama
target:
  api_endpoint:
    stream: false

hellaswag#

The HellaSwag benchmark tests a language model’s commonsense reasoning by having it choose the most logical ending for a given story.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: hellaswag

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: hellaswag
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 10
  supported_endpoint_types:
  - completions
  type: hellaswag
target:
  api_endpoint:
    stream: false

hellaswag_multilingual#

The multilingual versions of the HellaSwag benchmark.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: hellaswag_multilingual

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: hellaswag_multilingual
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 10
  supported_endpoint_types:
  - completions
  type: hellaswag_multilingual
target:
  api_endpoint:
    stream: false

humaneval_instruct#

  • The HumanEval benchmark measures functional correctness for synthesizing programs from docstrings. - Implementation taken from llama.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: humaneval_instruct

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: humaneval_instruct
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: humaneval_instruct
target:
  api_endpoint:
    stream: false

ifeval#

IFEval is a dataset designed to test a model’s ability to follow explicit instructions, such as “include keyword x” or “use format y.” The focus is on the model’s adherence to formatting instructions rather than the content generated, allowing for the use of strict and rigorous metrics.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: ifeval

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: ifeval
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: ifeval
target:
  api_endpoint:
    stream: false

m_mmlu_id_str_chat#

  • The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (chat endpoint).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: m_mmlu_id_str_chat

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: m_mmlu_id_str
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
      args: --trust_remote_code
  supported_endpoint_types:
  - chat
  type: m_mmlu_id_str_chat
target:
  api_endpoint:
    stream: false

m_mmlu_id_str_completions#

  • The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (completions endpoint).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: m_mmlu_id_str_completions

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: m_mmlu_id_str
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
      args: --trust_remote_code
  supported_endpoint_types:
  - completions
  type: m_mmlu_id_str_completions
target:
  api_endpoint:
    stream: false

mbpp_plus_chat#

MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (chat endpoint).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mbpp_plus_chat

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mbpp_plus
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      args: --confirm_run_unsafe_code
  supported_endpoint_types:
  - chat
  type: mbpp_plus_chat
target:
  api_endpoint:
    stream: false

mbpp_plus_completions#

MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (completions endpoint).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mbpp_plus_completions

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mbpp_plus
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      args: --confirm_run_unsafe_code
  supported_endpoint_types:
  - completions
  type: mbpp_plus_completions
target:
  api_endpoint:
    stream: false

mgsm#

  • The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mgsm

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mgsm_direct
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mgsm
target:
  api_endpoint:
    stream: false

mgsm_cot_chat#

  • The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the chat endpoint and defaults to chain-of-thought evaluation.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mgsm_cot_chat

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: mgsm_cot_native
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
  supported_endpoint_types:
  - chat
  type: mgsm_cot_chat
target:
  api_endpoint:
    stream: false

mgsm_cot_completions#

  • The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the completions endpoint and defaults to chain-of-thought evaluation.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mgsm_cot_completions

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: mgsm_cot_native
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
  supported_endpoint_types:
  - completions
  type: mgsm_cot_completions
target:
  api_endpoint:
    stream: false

mmlu#

  • The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses text generation.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_str
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
      args: --trust_remote_code
  supported_endpoint_types:
  - completions
  type: mmlu
target:
  api_endpoint:
    stream: false

mmlu_cot_0_shot_chat#

  • The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant defaults to chain-of-thought zero-shot evaluation.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_cot_0_shot_chat

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_cot_0_shot_chat
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      args: --trust_remote_code
  supported_endpoint_types:
  - chat
  type: mmlu_cot_0_shot_chat
target:
  api_endpoint:
    stream: false

mmlu_instruct#

  • The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the chat endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_instruct

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_str
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
      args: --trust_remote_code --add_instruction
  supported_endpoint_types:
  - chat
  type: mmlu_instruct
target:
  api_endpoint:
    stream: false

mmlu_instruct_completions#

  • The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the completions endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_instruct_completions

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_str
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
      args: --trust_remote_code --add_instruction
  supported_endpoint_types:
  - completions
  type: mmlu_instruct_completions
target:
  api_endpoint:
    stream: false

mmlu_logits#

  • The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the logits of the model to evaluate the accuracy.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_logits

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
  supported_endpoint_types:
  - completions
  type: mmlu_logits
target:
  api_endpoint:
    stream: false

mmlu_pro#

MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4 (completions endpoint).

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_pro

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_pro
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
  supported_endpoint_types:
  - completions
  type: mmlu_pro
target:
  api_endpoint:
    stream: false

mmlu_pro_instruct#

  • MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4. - This variant applies a chat template and defaults to zero-shot evaluation.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_pro_instruct

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: mmlu_pro
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
  supported_endpoint_types:
  - chat
  type: mmlu_pro_instruct
target:
  api_endpoint:
    stream: false

mmlu_prox_chat#

A Multilingual Benchmark for Advanced Large Language Model Evaluation (chat endpoint)

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_chat

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: mmlu_prox_chat
target:
  api_endpoint:
    stream: false

mmlu_prox_completions#

A Multilingual Benchmark for Advanced Large Language Model Evaluation (completions endpoint)

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_completions

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mmlu_prox_completions
target:
  api_endpoint:
    stream: false

mmlu_prox_de_chat#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (chat endpoint)

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_de_chat

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_de
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: mmlu_prox_de_chat
target:
  api_endpoint:
    stream: false

mmlu_prox_de_completions#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (completions endpoint)

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_de_completions

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_de
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mmlu_prox_de_completions
target:
  api_endpoint:
    stream: false

mmlu_prox_es_chat#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (chat endpoint)

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_es_chat

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_es
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: mmlu_prox_es_chat
target:
  api_endpoint:
    stream: false

mmlu_prox_es_completions#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (completions endpoint)

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_es_completions

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_es
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mmlu_prox_es_completions
target:
  api_endpoint:
    stream: false

mmlu_prox_fr_chat#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (chat endpoint)

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_fr_chat

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_fr
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: mmlu_prox_fr_chat
target:
  api_endpoint:
    stream: false

mmlu_prox_fr_completions#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (completions endpoint)

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_fr_completions

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_fr
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mmlu_prox_fr_completions
target:
  api_endpoint:
    stream: false

mmlu_prox_it_chat#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (chat endpoint)

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_it_chat

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_it
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: mmlu_prox_it_chat
target:
  api_endpoint:
    stream: false

mmlu_prox_it_completions#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (completions endpoint)

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_it_completions

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_it
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mmlu_prox_it_completions
target:
  api_endpoint:
    stream: false

mmlu_prox_ja_chat#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (chat endpoint)

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_ja_chat

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_ja
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - chat
  type: mmlu_prox_ja_chat
target:
  api_endpoint:
    stream: false

mmlu_prox_ja_completions#

A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (completions endpoint)

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_prox_ja_completions

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_prox_ja
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mmlu_prox_ja_completions
target:
  api_endpoint:
    stream: false

mmlu_redux#

MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_redux

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: mmlu_redux
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: mmlu_redux
target:
  api_endpoint:
    stream: false

mmlu_redux_instruct#

  • MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. - This variant applies a chat template and defaults to zero-shot evaluation.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: mmlu_redux_instruct

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_new_tokens: 8192
    max_retries: 5
    parallelism: 10
    task: mmlu_redux
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 0
      args: --add_instruction
  supported_endpoint_types:
  - chat
  type: mmlu_redux_instruct
target:
  api_endpoint:
    stream: false

musr#

The MuSR (Multistep Soft Reasoning) benchmark evaluates the reasoning capabilities of large language models through complex, multistep tasks specified in natural language narratives.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: musr

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: leaderboard_musr
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: musr
target:
  api_endpoint:
    stream: false

openbookqa#

  • OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding of a subject. - Answering OpenBookQA questions requires additional broad common knowledge, not contained in the book. - The questions, by design, are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: openbookqa

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: openbookqa
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: openbookqa
target:
  api_endpoint:
    stream: false

piqa#

  • Physical Interaction: Question Answering (PIQA) is a physical commonsense reasoning benchmark designed to investigate the physical knowledge of large language models.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: piqa

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: piqa
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: piqa
target:
  api_endpoint:
    stream: false

social_iqa#

  • Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: social_iqa

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: social_iqa
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      args: --trust_remote_code
  supported_endpoint_types:
  - completions
  type: social_iqa
target:
  api_endpoint:
    stream: false

truthfulqa#

  • The TruthfulQA benchmark measures the truthfulness of language models in generating answers to questions. - It consists of 817 questions across 38 categories, such as health, law, finance, and politics, designed to test whether models can avoid generating false answers that mimic common human misconceptions.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: truthfulqa

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: truthfulqa
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
  supported_endpoint_types:
  - completions
  type: truthfulqa
target:
  api_endpoint:
    stream: false

wikilingua#

  • The WikiLingua benchmark is a large-scale, multilingual dataset designed for evaluating cross-lingual abstractive summarization systems.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: wikilingua

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: wikilingua
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      args: --trust_remote_code
  supported_endpoint_types:
  - chat
  type: wikilingua
target:
  api_endpoint:
    stream: false

wikitext#

  • The WikiText language modeling dataset is a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. - This task measures perplexity on the WikiText-2 dataset via rolling loglikelihoods.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: wikitext

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: wikitext
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      args: --trust_remote_code
  supported_endpoint_types:
  - completions
  type: wikitext
target:
  api_endpoint:
    stream: false

winogrande#

WinoGrande is a collection of 44k problems formulated as a fill-in-a-blank task with binary options testing commonsense reasoning.

Harness: lm-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01

Container Digest:

sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26

Container Arch: multiarch

Task Type: winogrande

{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
  params:
    max_retries: 5
    parallelism: 10
    task: winogrande
    temperature: 1.0e-07
    request_timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: null
      tokenizer_backend: None
      downsampling_ratio: null
      tokenized_requests: false
      num_fewshot: 5
  supported_endpoint_types:
  - completions
  type: winogrande
target:
  api_endpoint:
    stream: false