bigcode-evaluation-harness#

This page contains all evaluation tasks for the bigcode-evaluation-harness harness.

Task

Description

humaneval

HumanEval is used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

humaneval_instruct

InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities.

humanevalplus

HumanEvalPlus is a modified version of HumanEval containing 80x more test cases.

mbpp-chat

MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the chat endpoint.

mbpp-completions

MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the completions endpoint.

mbppplus-chat

MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the chat endpoint.

mbppplus-completions

MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the completions endpoint.

mbppplus_nemo

MBPP+NeMo is a modified version of MBPP+ that uses the NeMo alignment prompt template.

multiple-cpp

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “cpp” subset.

multiple-cs

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “cs” subset.

multiple-d

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “d” subset.

multiple-go

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “go” subset.

multiple-java

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “java” subset.

multiple-jl

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “jl” subset.

multiple-js

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “js” subset.

multiple-lua

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “lua” subset.

multiple-php

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “php” subset.

multiple-pl

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “pl” subset.

multiple-py

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “py” subset.

multiple-r

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “r” subset.

multiple-rb

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rb” subset.

multiple-rkt

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rkt” subset.

multiple-rs

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rs” subset.

multiple-scala

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “scala” subset.

multiple-sh

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “sh” subset.

multiple-swift

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “swift” subset.

multiple-ts

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “ts” subset.

humaneval#

HumanEval is used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: humaneval

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: humaneval
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 20
  supported_endpoint_types:
  - completions
  type: humaneval
target:
  api_endpoint: {}

humaneval_instruct#

InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: humaneval_instruct

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: instruct-humaneval-nocontext-py
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 20
  supported_endpoint_types:
  - chat
  type: humaneval_instruct
target:
  api_endpoint: {}

humanevalplus#

HumanEvalPlus is a modified version of HumanEval containing 80x more test cases.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: humanevalplus

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: humanevalplus
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: humanevalplus
target:
  api_endpoint: {}

mbpp-chat#

MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the chat endpoint.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: mbpp-chat

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 2048
    max_retries: 5
    parallelism: 10
    task: mbpp
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 10
  supported_endpoint_types:
  - chat
  type: mbpp-chat
target:
  api_endpoint: {}

mbpp-completions#

MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the completions endpoint.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: mbpp-completions

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 2048
    max_retries: 5
    parallelism: 10
    task: mbpp
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 10
  supported_endpoint_types:
  - completions
  type: mbpp-completions
target:
  api_endpoint: {}

mbppplus-chat#

MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the chat endpoint.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: mbppplus-chat

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 2048
    max_retries: 5
    parallelism: 10
    task: mbppplus
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - chat
  type: mbppplus-chat
target:
  api_endpoint: {}

mbppplus-completions#

MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the completions endpoint.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: mbppplus-completions

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 2048
    max_retries: 5
    parallelism: 10
    task: mbppplus
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: mbppplus-completions
target:
  api_endpoint: {}

mbppplus_nemo#

MBPP+NeMo is a modified version of MBPP+ that uses the NeMo alignment prompt template.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: mbppplus_nemo

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 2048
    max_retries: 5
    parallelism: 10
    task: mbppplus_nemo
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - chat
  type: mbppplus_nemo
target:
  api_endpoint: {}

multiple-cpp#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “cpp” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-cpp

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-cpp
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-cpp
target:
  api_endpoint: {}

multiple-cs#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “cs” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-cs

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-cs
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-cs
target:
  api_endpoint: {}

multiple-d#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “d” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-d

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-d
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-d
target:
  api_endpoint: {}

multiple-go#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “go” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-go

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-go
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-go
target:
  api_endpoint: {}

multiple-java#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “java” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-java

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-java
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-java
target:
  api_endpoint: {}

multiple-jl#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “jl” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-jl

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-jl
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-jl
target:
  api_endpoint: {}

multiple-js#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “js” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-js

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-js
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-js
target:
  api_endpoint: {}

multiple-lua#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “lua” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-lua

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-lua
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-lua
target:
  api_endpoint: {}

multiple-php#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “php” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-php

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-php
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-php
target:
  api_endpoint: {}

multiple-pl#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “pl” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-pl

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-pl
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-pl
target:
  api_endpoint: {}

multiple-py#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “py” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-py

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-py
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-py
target:
  api_endpoint: {}

multiple-r#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “r” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-r

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-r
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-r
target:
  api_endpoint: {}

multiple-rb#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rb” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-rb

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-rb
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-rb
target:
  api_endpoint: {}

multiple-rkt#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rkt” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-rkt

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-rkt
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-rkt
target:
  api_endpoint: {}

multiple-rs#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rs” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-rs

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-rs
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-rs
target:
  api_endpoint: {}

multiple-scala#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “scala” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-scala

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-scala
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-scala
target:
  api_endpoint: {}

multiple-sh#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “sh” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-sh

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-sh
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-sh
target:
  api_endpoint: {}

multiple-swift#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “swift” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-swift

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-swift
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-swift
target:
  api_endpoint: {}

multiple-ts#

MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “ts” subset.

Harness: bigcode-evaluation-harness

Container:

nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01

Container Digest:

sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd

Container Arch: multiarch

Task Type: multiple-ts

{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
  params:
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: multiple-ts
    temperature: 0.1
    request_timeout: 30
    top_p: 0.95
    extra:
      do_sample: true
      n_samples: 5
  supported_endpoint_types:
  - completions
  type: multiple-ts
target:
  api_endpoint: {}