bigcode-evaluation-harness#
This page contains all evaluation tasks for the bigcode-evaluation-harness harness.
Task |
Description |
|---|---|
HumanEval is used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. |
|
InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities. |
|
HumanEvalPlus is a modified version of HumanEval containing 80x more test cases. |
|
MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the chat endpoint. |
|
MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the completions endpoint. |
|
MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the chat endpoint. |
|
MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the completions endpoint. |
|
MBPP+NeMo is a modified version of MBPP+ that uses the NeMo alignment prompt template. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “cpp” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “cs” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “d” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “go” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “java” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “jl” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “js” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “lua” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “php” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “pl” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “py” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “r” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rb” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rkt” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rs” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “scala” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “sh” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “swift” subset. |
|
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “ts” subset. |
humaneval#
HumanEval is used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: humaneval
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: humaneval
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 20
supported_endpoint_types:
- completions
type: humaneval
target:
api_endpoint: {}
humaneval_instruct#
InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: humaneval_instruct
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: instruct-humaneval-nocontext-py
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 20
supported_endpoint_types:
- chat
type: humaneval_instruct
target:
api_endpoint: {}
humanevalplus#
HumanEvalPlus is a modified version of HumanEval containing 80x more test cases.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: humanevalplus
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: humanevalplus
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: humanevalplus
target:
api_endpoint: {}
mbpp-chat#
MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the chat endpoint.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: mbpp-chat
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 10
task: mbpp
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 10
supported_endpoint_types:
- chat
type: mbpp-chat
target:
api_endpoint: {}
mbpp-completions#
MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the completions endpoint.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: mbpp-completions
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 10
task: mbpp
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 10
supported_endpoint_types:
- completions
type: mbpp-completions
target:
api_endpoint: {}
mbppplus-chat#
MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the chat endpoint.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: mbppplus-chat
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 10
task: mbppplus
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- chat
type: mbppplus-chat
target:
api_endpoint: {}
mbppplus-completions#
MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the completions endpoint.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: mbppplus-completions
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 10
task: mbppplus
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: mbppplus-completions
target:
api_endpoint: {}
mbppplus_nemo#
MBPP+NeMo is a modified version of MBPP+ that uses the NeMo alignment prompt template.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: mbppplus_nemo
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 10
task: mbppplus_nemo
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- chat
type: mbppplus_nemo
target:
api_endpoint: {}
multiple-cpp#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “cpp” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-cpp
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-cpp
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-cpp
target:
api_endpoint: {}
multiple-cs#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “cs” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-cs
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-cs
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-cs
target:
api_endpoint: {}
multiple-d#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “d” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-d
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-d
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-d
target:
api_endpoint: {}
multiple-go#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “go” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-go
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-go
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-go
target:
api_endpoint: {}
multiple-java#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “java” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-java
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-java
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-java
target:
api_endpoint: {}
multiple-jl#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “jl” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-jl
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-jl
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-jl
target:
api_endpoint: {}
multiple-js#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “js” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-js
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-js
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-js
target:
api_endpoint: {}
multiple-lua#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “lua” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-lua
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-lua
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-lua
target:
api_endpoint: {}
multiple-php#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “php” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-php
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-php
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-php
target:
api_endpoint: {}
multiple-pl#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “pl” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-pl
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-pl
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-pl
target:
api_endpoint: {}
multiple-py#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “py” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-py
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-py
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-py
target:
api_endpoint: {}
multiple-r#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “r” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-r
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-r
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-r
target:
api_endpoint: {}
multiple-rb#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rb” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-rb
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-rb
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-rb
target:
api_endpoint: {}
multiple-rkt#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rkt” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-rkt
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-rkt
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-rkt
target:
api_endpoint: {}
multiple-rs#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “rs” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-rs
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-rs
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-rs
target:
api_endpoint: {}
multiple-scala#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “scala” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-scala
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-scala
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-scala
target:
api_endpoint: {}
multiple-sh#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “sh” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-sh
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-sh
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-sh
target:
api_endpoint: {}
multiple-swift#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “swift” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-swift
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-swift
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-swift
target:
api_endpoint: {}
multiple-ts#
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the “ts” subset.
Harness: bigcode-evaluation-harness
Container:
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
Container Digest:
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
Container Arch: multiarch
Task Type: multiple-ts
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-ts
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-ts
target:
api_endpoint: {}