bigcode-evaluation-harness#

This page contains all evaluation tasks for the bigcode-evaluation-harness harness.

Task	Description
humaneval	HumanEval is used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
humaneval_instruct	InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities.
humanevalplus	HumanEvalPlus is a modified version of HumanEval containing 80x more test cases.
mbpp-chat	MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the chat endpoint.
mbpp-completions	MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the completions endpoint.
mbppplus-chat	MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the chat endpoint.
mbppplus-completions	MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the completions endpoint.
mbppplus_nemo	MBPP+NeMo is a modified version of MBPP+ that uses the NeMo alignment prompt template.
multiple-cpp	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “cpp” subset.
multiple-cs	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “cs” subset.
multiple-d	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “d” subset.
multiple-go	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “go” subset.
multiple-java	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “java” subset.
multiple-jl	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “jl” subset.
multiple-js	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “js” subset.
multiple-lua	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “lua” subset.
multiple-php	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “php” subset.
multiple-pl	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “pl” subset.
multiple-py	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “py” subset.
multiple-r	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “r” subset.
multiple-rb	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rb” subset.
multiple-rkt	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rkt” subset.
multiple-rs	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rs” subset.
multiple-scala	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “scala” subset.
multiple-sh	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “sh” subset.
multiple-swift	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “swift” subset.
multiple-ts	MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “ts” subset.

humaneval#

HumanEval is used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: humaneval

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: humaneval
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 20
  supported_endpoint_types:
  - completions
  type: humaneval
target:
  api_endpoint: {}

humaneval_instruct#

InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: humaneval_instruct

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: instruct-humaneval-nocontext-py
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 20
  supported_endpoint_types:
  - chat
  type: humaneval_instruct
target:
  api_endpoint: {}

humanevalplus#

HumanEvalPlus is a modified version of HumanEval containing 80x more test cases.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: humanevalplus

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: humanevalplus
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: humanevalplus
target:
  api_endpoint: {}

mbpp-chat#

MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the chat endpoint.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: mbpp-chat

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 2048
    max_retries: 5
    parallelism: 10
    task: mbpp
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 10
  supported_endpoint_types:
  - chat
  type: mbpp-chat
target:
  api_endpoint: {}

mbpp-completions#

MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the completions endpoint.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: mbpp-completions

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 2048
    max_retries: 5
    parallelism: 10
    task: mbpp
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 10
  supported_endpoint_types:
  - completions
  type: mbpp-completions
target:
  api_endpoint: {}

mbppplus-chat#

MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the chat endpoint.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: mbppplus-chat

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 2048
    max_retries: 5
    parallelism: 10
    task: mbppplus
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - chat
  type: mbppplus-chat
target:
  api_endpoint: {}

mbppplus-completions#

MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the completions endpoint.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: mbppplus-completions

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 2048
    max_retries: 5
    parallelism: 10
    task: mbppplus
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: mbppplus-completions
target:
  api_endpoint: {}

mbppplus_nemo#

MBPP+NeMo is a modified version of MBPP+ that uses the NeMo alignment prompt template.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: mbppplus_nemo

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 2048
    max_retries: 5
    parallelism: 10
    task: mbppplus_nemo
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - chat
  type: mbppplus_nemo
target:
  api_endpoint: {}

multiple-cpp#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “cpp” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-cpp

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-cpp
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-cpp
target:
  api_endpoint: {}

multiple-cs#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “cs” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-cs

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-cs
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-cs
target:
  api_endpoint: {}

multiple-d#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “d” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-d

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-d
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-d
target:
  api_endpoint: {}

multiple-go#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “go” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-go

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-go
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-go
target:
  api_endpoint: {}

multiple-java#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “java” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-java

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-java
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-java
target:
  api_endpoint: {}

multiple-jl#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “jl” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-jl

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-jl
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-jl
target:
  api_endpoint: {}

multiple-js#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “js” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-js

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-js
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-js
target:
  api_endpoint: {}

multiple-lua#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “lua” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-lua

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-lua
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-lua
target:
  api_endpoint: {}

multiple-php#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “php” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-php

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-php
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-php
target:
  api_endpoint: {}

multiple-pl#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “pl” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-pl

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-pl
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-pl
target:
  api_endpoint: {}

multiple-py#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “py” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-py

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-py
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-py
target:
  api_endpoint: {}

multiple-r#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “r” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-r

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-r
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-r
target:
  api_endpoint: {}

multiple-rb#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rb” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-rb

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-rb
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-rb
target:
  api_endpoint: {}

multiple-rkt#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rkt” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-rkt

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-rkt
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-rkt
target:
  api_endpoint: {}

multiple-rs#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rs” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-rs

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-rs
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-rs
target:
  api_endpoint: {}

multiple-scala#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “scala” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-scala

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-scala
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-scala
target:
  api_endpoint: {}

multiple-sh#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “sh” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-sh

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-sh
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-sh
target:
  api_endpoint: {}

multiple-swift#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “swift” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-swift

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-swift
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-swift
target:
  api_endpoint: {}

multiple-ts#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “ts” subset.

Container

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-ts

Command

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}

Defaults

framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-ts
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-ts
target:
  api_endpoint: {}