livecodebench#
This page contains all evaluation tasks for the livecodebench harness.
Task |
Description |
|---|---|
“Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. |
|
“CoT. Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. Chain-of-Thought version of the task. |
|
Not fast version of code generation (v2). |
|
Code generation latest version |
|
The initial release of the dataset (v1) with problems released between May 2023 and Mar 2024 containing 400 problems. |
|
The updated release of the dataset (v2) with problems released between May 2023 and May 2024 containing 511 problems. |
|
The updated release of the dataset (v3) with problems released between May 2023 and Jul 2024 containing 612 problems. |
|
The updated release of the dataset (v4) with problems released between May 2023 and Sep 2024 containing 713 problems. |
|
The updated release of the dataset (v5) with problems released between May 2023 and Jan 2025 containing 880 problems. |
|
The updated release of the dataset (v6) with problems released between May 2023 and Apr 2025 containing 1055 problems. |
|
|
|
[‘Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result.’, ‘The data period and sampling parameters used by NeMo Alignment team.’] |
|
|
|
Solve the natural language task on a specified input, evaluating the ability to generate testing outputs. The model is given the natural language problem description and an input, and the output should be the output for the problem. |
codeexecution_v2#
“Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result.
Harness: livecodebench
Container:
nvcr.io/nvidia/eval-factory/livecodebench:26.01
Container Digest:
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
Container Arch: multiarch
Task Type: codeexecution_v2
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codeexecution
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_v2
supported_endpoint_types:
- chat
type: codeexecution_v2
target:
api_endpoint: {}
codeexecution_v2_cot#
“CoT. Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. Chain-of-Thought version of the task.
Harness: livecodebench
Container:
nvcr.io/nvidia/eval-factory/livecodebench:26.01
Container Digest:
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
Container Arch: multiarch
Task Type: codeexecution_v2_cot
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codeexecution
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: true
release_version: release_v2
supported_endpoint_types:
- chat
type: codeexecution_v2_cot
target:
api_endpoint: {}
codegeneration_notfast#
Not fast version of code generation (v2).
Harness: livecodebench
Container:
nvcr.io/nvidia/eval-factory/livecodebench:26.01
Container Digest:
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
Container Arch: multiarch
Task Type: codegeneration_notfast
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
args: --not_fast
supported_endpoint_types:
- chat
type: codegeneration_notfast
target:
api_endpoint: {}
codegeneration_release_latest#
Code generation latest version
Harness: livecodebench
Container:
nvcr.io/nvidia/eval-factory/livecodebench:26.01
Container Digest:
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
Container Arch: multiarch
Task Type: codegeneration_release_latest
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_latest
supported_endpoint_types:
- chat
type: codegeneration_release_latest
target:
api_endpoint: {}
codegeneration_release_v1#
The initial release of the dataset (v1) with problems released between May 2023 and Mar 2024 containing 400 problems.
Harness: livecodebench
Container:
nvcr.io/nvidia/eval-factory/livecodebench:26.01
Container Digest:
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
Container Arch: multiarch
Task Type: codegeneration_release_v1
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_v1
supported_endpoint_types:
- chat
type: codegeneration_release_v1
target:
api_endpoint: {}
codegeneration_release_v2#
The updated release of the dataset (v2) with problems released between May 2023 and May 2024 containing 511 problems.
Harness: livecodebench
Container:
nvcr.io/nvidia/eval-factory/livecodebench:26.01
Container Digest:
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
Container Arch: multiarch
Task Type: codegeneration_release_v2
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_v2
supported_endpoint_types:
- chat
type: codegeneration_release_v2
target:
api_endpoint: {}
codegeneration_release_v3#
The updated release of the dataset (v3) with problems released between May 2023 and Jul 2024 containing 612 problems.
Harness: livecodebench
Container:
nvcr.io/nvidia/eval-factory/livecodebench:26.01
Container Digest:
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
Container Arch: multiarch
Task Type: codegeneration_release_v3
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_v3
supported_endpoint_types:
- chat
type: codegeneration_release_v3
target:
api_endpoint: {}
codegeneration_release_v4#
The updated release of the dataset (v4) with problems released between May 2023 and Sep 2024 containing 713 problems.
Harness: livecodebench
Container:
nvcr.io/nvidia/eval-factory/livecodebench:26.01
Container Digest:
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
Container Arch: multiarch
Task Type: codegeneration_release_v4
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_v4
supported_endpoint_types:
- chat
type: codegeneration_release_v4
target:
api_endpoint: {}
codegeneration_release_v5#
The updated release of the dataset (v5) with problems released between May 2023 and Jan 2025 containing 880 problems.
Harness: livecodebench
Container:
nvcr.io/nvidia/eval-factory/livecodebench:26.01
Container Digest:
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
Container Arch: multiarch
Task Type: codegeneration_release_v5
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_v5
supported_endpoint_types:
- chat
type: codegeneration_release_v5
target:
api_endpoint: {}
codegeneration_release_v6#
The updated release of the dataset (v6) with problems released between May 2023 and Apr 2025 containing 1055 problems.
Harness: livecodebench
Container:
nvcr.io/nvidia/eval-factory/livecodebench:26.01
Container Digest:
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
Container Arch: multiarch
Task Type: codegeneration_release_v6
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_v6
supported_endpoint_types:
- chat
type: codegeneration_release_v6
target:
api_endpoint: {}
livecodebench_0724_0125#
Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. - The data period and sampling parameters used by Artificial Analaysis (https://artificialanalysis.ai/methodology/intelligence-benchmarking)
Harness: livecodebench
Container:
nvcr.io/nvidia/eval-factory/livecodebench:26.01
Container Digest:
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
Container Arch: multiarch
Task Type: livecodebench_0724_0125
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 3
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: 2024-07-01
end_date: 2025-01-01
cot_code_execution: false
release_version: release_v5
supported_endpoint_types:
- chat
type: livecodebench_0724_0125
target:
api_endpoint: {}
livecodebench_0824_0225#
[‘Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result.’, ‘The data period and sampling parameters used by NeMo Alignment team.’]
Harness: livecodebench
Container:
nvcr.io/nvidia/eval-factory/livecodebench:26.01
Container Digest:
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
Container Arch: multiarch
Task Type: livecodebench_0824_0225
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 3
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: 2024-08-01
end_date: 2025-02-01
cot_code_execution: false
release_version: release_v5
supported_endpoint_types:
- chat
type: livecodebench_0824_0225
target:
api_endpoint: {}
livecodebench_aa_v2#
Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. - The data period and sampling parameters used by Artificial Analaysis (https://artificialanalysis.ai/methodology/intelligence-benchmarking)
Harness: livecodebench
Container:
nvcr.io/nvidia/eval-factory/livecodebench:26.01
Container Digest:
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
Container Arch: multiarch
Task Type: livecodebench_aa_v2
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 3
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: 2024-07-01
end_date: 2025-01-01
cot_code_execution: false
release_version: release_v5
supported_endpoint_types:
- chat
type: livecodebench_aa_v2
target:
api_endpoint: {}
testoutputprediction#
Solve the natural language task on a specified input, evaluating the ability to generate testing outputs. The model is given the natural language problem description and an input, and the output should be the output for the problem.
Harness: livecodebench
Container:
nvcr.io/nvidia/eval-factory/livecodebench:26.01
Container Digest:
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
Container Arch: multiarch
Task Type: testoutputprediction
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: testoutputprediction
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_latest
supported_endpoint_types:
- chat
type: testoutputprediction
target:
api_endpoint: {}