livecodebench#

This page contains all evaluation tasks for the livecodebench harness.

Task

Description

codeexecution_v2

“Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result.

codeexecution_v2_cot

“CoT. Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. Chain-of-Thought version of the task.

codegeneration_notfast

Not fast version of code generation (v2).

codegeneration_release_latest

Code generation latest version

codegeneration_release_v1

The initial release of the dataset (v1) with problems released between May 2023 and Mar 2024 containing 400 problems.

codegeneration_release_v2

The updated release of the dataset (v2) with problems released between May 2023 and May 2024 containing 511 problems.

codegeneration_release_v3

The updated release of the dataset (v3) with problems released between May 2023 and Jul 2024 containing 612 problems.

codegeneration_release_v4

The updated release of the dataset (v4) with problems released between May 2023 and Sep 2024 containing 713 problems.

codegeneration_release_v5

The updated release of the dataset (v5) with problems released between May 2023 and Jan 2025 containing 880 problems.

codegeneration_release_v6

The updated release of the dataset (v6) with problems released between May 2023 and Apr 2025 containing 1055 problems.

livecodebench_0724_0125

  • Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. - The data period and sampling parameters used by Artificial Analaysis (https://artificialanalysis.ai/methodology/intelligence-benchmarking)

livecodebench_0824_0225

[‘Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result.’, ‘The data period and sampling parameters used by NeMo Alignment team.’]

livecodebench_aa_v2

  • Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. - The data period and sampling parameters used by Artificial Analaysis (https://artificialanalysis.ai/methodology/intelligence-benchmarking)

testoutputprediction

Solve the natural language task on a specified input, evaluating the ability to generate testing outputs. The model is given the natural language problem description and an input, and the output should be the output for the problem.

codeexecution_v2#

“Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result.

Harness: livecodebench

Container:

nvcr.io/nvidia/eval-factory/livecodebench:26.01

Container Digest:

sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e

Container Arch: multiarch

Task Type: codeexecution_v2

{% if target.api_endpoint.api_key_name is not none %}
  export API_KEY=${{target.api_endpoint.api_key_name}} && 
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
            --scenario {{config.params.task}} \
            --release_version {{config.params.extra.release_version}} \
            --url {{target.api_endpoint.url}} \
            --temperature {{config.params.temperature}} \
            --top_p {{config.params.top_p}} \
            --evaluate \
            --codegen_n {{config.params.extra.n_samples}} \
            --use_cache \
            --cache_batch_size {{config.params.extra.cache_batch_size}} \
            --num_process_evaluate {{config.params.extra.num_process_evaluate}} \
            --n {{config.params.extra.n_samples}} \
            --max_tokens {{config.params.max_new_tokens}} \
            --out_dir {{config.output_dir}} \
            --multiprocess {{config.params.parallelism}} \
            --max_retries {{config.params.max_retries}} \
            --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
  params:
    max_new_tokens: 4096
    max_retries: 5
    parallelism: 10
    task: codeexecution
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      num_process_evaluate: 5
      cache_batch_size: 10
      support_system_role: false
      start_date: null
      end_date: null
      cot_code_execution: false
      release_version: release_v2
  supported_endpoint_types:
  - chat
  type: codeexecution_v2
target:
  api_endpoint: {}

codeexecution_v2_cot#

“CoT. Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. Chain-of-Thought version of the task.

Harness: livecodebench

Container:

nvcr.io/nvidia/eval-factory/livecodebench:26.01

Container Digest:

sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e

Container Arch: multiarch

Task Type: codeexecution_v2_cot

{% if target.api_endpoint.api_key_name is not none %}
  export API_KEY=${{target.api_endpoint.api_key_name}} && 
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
            --scenario {{config.params.task}} \
            --release_version {{config.params.extra.release_version}} \
            --url {{target.api_endpoint.url}} \
            --temperature {{config.params.temperature}} \
            --top_p {{config.params.top_p}} \
            --evaluate \
            --codegen_n {{config.params.extra.n_samples}} \
            --use_cache \
            --cache_batch_size {{config.params.extra.cache_batch_size}} \
            --num_process_evaluate {{config.params.extra.num_process_evaluate}} \
            --n {{config.params.extra.n_samples}} \
            --max_tokens {{config.params.max_new_tokens}} \
            --out_dir {{config.output_dir}} \
            --multiprocess {{config.params.parallelism}} \
            --max_retries {{config.params.max_retries}} \
            --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
  params:
    max_new_tokens: 4096
    max_retries: 5
    parallelism: 10
    task: codeexecution
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      num_process_evaluate: 5
      cache_batch_size: 10
      support_system_role: false
      start_date: null
      end_date: null
      cot_code_execution: true
      release_version: release_v2
  supported_endpoint_types:
  - chat
  type: codeexecution_v2_cot
target:
  api_endpoint: {}

codegeneration_notfast#

Not fast version of code generation (v2).

Harness: livecodebench

Container:

nvcr.io/nvidia/eval-factory/livecodebench:26.01

Container Digest:

sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e

Container Arch: multiarch

Task Type: codegeneration_notfast

{% if target.api_endpoint.api_key_name is not none %}
  export API_KEY=${{target.api_endpoint.api_key_name}} && 
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
            --scenario {{config.params.task}} \
            --release_version {{config.params.extra.release_version}} \
            --url {{target.api_endpoint.url}} \
            --temperature {{config.params.temperature}} \
            --top_p {{config.params.top_p}} \
            --evaluate \
            --codegen_n {{config.params.extra.n_samples}} \
            --use_cache \
            --cache_batch_size {{config.params.extra.cache_batch_size}} \
            --num_process_evaluate {{config.params.extra.num_process_evaluate}} \
            --n {{config.params.extra.n_samples}} \
            --max_tokens {{config.params.max_new_tokens}} \
            --out_dir {{config.output_dir}} \
            --multiprocess {{config.params.parallelism}} \
            --max_retries {{config.params.max_retries}} \
            --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
  params:
    max_new_tokens: 4096
    max_retries: 5
    parallelism: 10
    task: codegeneration
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      num_process_evaluate: 5
      cache_batch_size: 10
      support_system_role: false
      start_date: null
      end_date: null
      cot_code_execution: false
      args: --not_fast
  supported_endpoint_types:
  - chat
  type: codegeneration_notfast
target:
  api_endpoint: {}

codegeneration_release_latest#

Code generation latest version

Harness: livecodebench

Container:

nvcr.io/nvidia/eval-factory/livecodebench:26.01

Container Digest:

sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e

Container Arch: multiarch

Task Type: codegeneration_release_latest

{% if target.api_endpoint.api_key_name is not none %}
  export API_KEY=${{target.api_endpoint.api_key_name}} && 
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
            --scenario {{config.params.task}} \
            --release_version {{config.params.extra.release_version}} \
            --url {{target.api_endpoint.url}} \
            --temperature {{config.params.temperature}} \
            --top_p {{config.params.top_p}} \
            --evaluate \
            --codegen_n {{config.params.extra.n_samples}} \
            --use_cache \
            --cache_batch_size {{config.params.extra.cache_batch_size}} \
            --num_process_evaluate {{config.params.extra.num_process_evaluate}} \
            --n {{config.params.extra.n_samples}} \
            --max_tokens {{config.params.max_new_tokens}} \
            --out_dir {{config.output_dir}} \
            --multiprocess {{config.params.parallelism}} \
            --max_retries {{config.params.max_retries}} \
            --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
  params:
    max_new_tokens: 4096
    max_retries: 5
    parallelism: 10
    task: codegeneration
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      num_process_evaluate: 5
      cache_batch_size: 10
      support_system_role: false
      start_date: null
      end_date: null
      cot_code_execution: false
      release_version: release_latest
  supported_endpoint_types:
  - chat
  type: codegeneration_release_latest
target:
  api_endpoint: {}

codegeneration_release_v1#

The initial release of the dataset (v1) with problems released between May 2023 and Mar 2024 containing 400 problems.

Harness: livecodebench

Container:

nvcr.io/nvidia/eval-factory/livecodebench:26.01

Container Digest:

sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e

Container Arch: multiarch

Task Type: codegeneration_release_v1

{% if target.api_endpoint.api_key_name is not none %}
  export API_KEY=${{target.api_endpoint.api_key_name}} && 
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
            --scenario {{config.params.task}} \
            --release_version {{config.params.extra.release_version}} \
            --url {{target.api_endpoint.url}} \
            --temperature {{config.params.temperature}} \
            --top_p {{config.params.top_p}} \
            --evaluate \
            --codegen_n {{config.params.extra.n_samples}} \
            --use_cache \
            --cache_batch_size {{config.params.extra.cache_batch_size}} \
            --num_process_evaluate {{config.params.extra.num_process_evaluate}} \
            --n {{config.params.extra.n_samples}} \
            --max_tokens {{config.params.max_new_tokens}} \
            --out_dir {{config.output_dir}} \
            --multiprocess {{config.params.parallelism}} \
            --max_retries {{config.params.max_retries}} \
            --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
  params:
    max_new_tokens: 4096
    max_retries: 5
    parallelism: 10
    task: codegeneration
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      num_process_evaluate: 5
      cache_batch_size: 10
      support_system_role: false
      start_date: null
      end_date: null
      cot_code_execution: false
      release_version: release_v1
  supported_endpoint_types:
  - chat
  type: codegeneration_release_v1
target:
  api_endpoint: {}

codegeneration_release_v2#

The updated release of the dataset (v2) with problems released between May 2023 and May 2024 containing 511 problems.

Harness: livecodebench

Container:

nvcr.io/nvidia/eval-factory/livecodebench:26.01

Container Digest:

sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e

Container Arch: multiarch

Task Type: codegeneration_release_v2

{% if target.api_endpoint.api_key_name is not none %}
  export API_KEY=${{target.api_endpoint.api_key_name}} && 
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
            --scenario {{config.params.task}} \
            --release_version {{config.params.extra.release_version}} \
            --url {{target.api_endpoint.url}} \
            --temperature {{config.params.temperature}} \
            --top_p {{config.params.top_p}} \
            --evaluate \
            --codegen_n {{config.params.extra.n_samples}} \
            --use_cache \
            --cache_batch_size {{config.params.extra.cache_batch_size}} \
            --num_process_evaluate {{config.params.extra.num_process_evaluate}} \
            --n {{config.params.extra.n_samples}} \
            --max_tokens {{config.params.max_new_tokens}} \
            --out_dir {{config.output_dir}} \
            --multiprocess {{config.params.parallelism}} \
            --max_retries {{config.params.max_retries}} \
            --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
  params:
    max_new_tokens: 4096
    max_retries: 5
    parallelism: 10
    task: codegeneration
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      num_process_evaluate: 5
      cache_batch_size: 10
      support_system_role: false
      start_date: null
      end_date: null
      cot_code_execution: false
      release_version: release_v2
  supported_endpoint_types:
  - chat
  type: codegeneration_release_v2
target:
  api_endpoint: {}

codegeneration_release_v3#

The updated release of the dataset (v3) with problems released between May 2023 and Jul 2024 containing 612 problems.

Harness: livecodebench

Container:

nvcr.io/nvidia/eval-factory/livecodebench:26.01

Container Digest:

sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e

Container Arch: multiarch

Task Type: codegeneration_release_v3

{% if target.api_endpoint.api_key_name is not none %}
  export API_KEY=${{target.api_endpoint.api_key_name}} && 
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
            --scenario {{config.params.task}} \
            --release_version {{config.params.extra.release_version}} \
            --url {{target.api_endpoint.url}} \
            --temperature {{config.params.temperature}} \
            --top_p {{config.params.top_p}} \
            --evaluate \
            --codegen_n {{config.params.extra.n_samples}} \
            --use_cache \
            --cache_batch_size {{config.params.extra.cache_batch_size}} \
            --num_process_evaluate {{config.params.extra.num_process_evaluate}} \
            --n {{config.params.extra.n_samples}} \
            --max_tokens {{config.params.max_new_tokens}} \
            --out_dir {{config.output_dir}} \
            --multiprocess {{config.params.parallelism}} \
            --max_retries {{config.params.max_retries}} \
            --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
  params:
    max_new_tokens: 4096
    max_retries: 5
    parallelism: 10
    task: codegeneration
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      num_process_evaluate: 5
      cache_batch_size: 10
      support_system_role: false
      start_date: null
      end_date: null
      cot_code_execution: false
      release_version: release_v3
  supported_endpoint_types:
  - chat
  type: codegeneration_release_v3
target:
  api_endpoint: {}

codegeneration_release_v4#

The updated release of the dataset (v4) with problems released between May 2023 and Sep 2024 containing 713 problems.

Harness: livecodebench

Container:

nvcr.io/nvidia/eval-factory/livecodebench:26.01

Container Digest:

sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e

Container Arch: multiarch

Task Type: codegeneration_release_v4

{% if target.api_endpoint.api_key_name is not none %}
  export API_KEY=${{target.api_endpoint.api_key_name}} && 
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
            --scenario {{config.params.task}} \
            --release_version {{config.params.extra.release_version}} \
            --url {{target.api_endpoint.url}} \
            --temperature {{config.params.temperature}} \
            --top_p {{config.params.top_p}} \
            --evaluate \
            --codegen_n {{config.params.extra.n_samples}} \
            --use_cache \
            --cache_batch_size {{config.params.extra.cache_batch_size}} \
            --num_process_evaluate {{config.params.extra.num_process_evaluate}} \
            --n {{config.params.extra.n_samples}} \
            --max_tokens {{config.params.max_new_tokens}} \
            --out_dir {{config.output_dir}} \
            --multiprocess {{config.params.parallelism}} \
            --max_retries {{config.params.max_retries}} \
            --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
  params:
    max_new_tokens: 4096
    max_retries: 5
    parallelism: 10
    task: codegeneration
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      num_process_evaluate: 5
      cache_batch_size: 10
      support_system_role: false
      start_date: null
      end_date: null
      cot_code_execution: false
      release_version: release_v4
  supported_endpoint_types:
  - chat
  type: codegeneration_release_v4
target:
  api_endpoint: {}

codegeneration_release_v5#

The updated release of the dataset (v5) with problems released between May 2023 and Jan 2025 containing 880 problems.

Harness: livecodebench

Container:

nvcr.io/nvidia/eval-factory/livecodebench:26.01

Container Digest:

sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e

Container Arch: multiarch

Task Type: codegeneration_release_v5

{% if target.api_endpoint.api_key_name is not none %}
  export API_KEY=${{target.api_endpoint.api_key_name}} && 
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
            --scenario {{config.params.task}} \
            --release_version {{config.params.extra.release_version}} \
            --url {{target.api_endpoint.url}} \
            --temperature {{config.params.temperature}} \
            --top_p {{config.params.top_p}} \
            --evaluate \
            --codegen_n {{config.params.extra.n_samples}} \
            --use_cache \
            --cache_batch_size {{config.params.extra.cache_batch_size}} \
            --num_process_evaluate {{config.params.extra.num_process_evaluate}} \
            --n {{config.params.extra.n_samples}} \
            --max_tokens {{config.params.max_new_tokens}} \
            --out_dir {{config.output_dir}} \
            --multiprocess {{config.params.parallelism}} \
            --max_retries {{config.params.max_retries}} \
            --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
  params:
    max_new_tokens: 4096
    max_retries: 5
    parallelism: 10
    task: codegeneration
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      num_process_evaluate: 5
      cache_batch_size: 10
      support_system_role: false
      start_date: null
      end_date: null
      cot_code_execution: false
      release_version: release_v5
  supported_endpoint_types:
  - chat
  type: codegeneration_release_v5
target:
  api_endpoint: {}

codegeneration_release_v6#

The updated release of the dataset (v6) with problems released between May 2023 and Apr 2025 containing 1055 problems.

Harness: livecodebench

Container:

nvcr.io/nvidia/eval-factory/livecodebench:26.01

Container Digest:

sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e

Container Arch: multiarch

Task Type: codegeneration_release_v6

{% if target.api_endpoint.api_key_name is not none %}
  export API_KEY=${{target.api_endpoint.api_key_name}} && 
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
            --scenario {{config.params.task}} \
            --release_version {{config.params.extra.release_version}} \
            --url {{target.api_endpoint.url}} \
            --temperature {{config.params.temperature}} \
            --top_p {{config.params.top_p}} \
            --evaluate \
            --codegen_n {{config.params.extra.n_samples}} \
            --use_cache \
            --cache_batch_size {{config.params.extra.cache_batch_size}} \
            --num_process_evaluate {{config.params.extra.num_process_evaluate}} \
            --n {{config.params.extra.n_samples}} \
            --max_tokens {{config.params.max_new_tokens}} \
            --out_dir {{config.output_dir}} \
            --multiprocess {{config.params.parallelism}} \
            --max_retries {{config.params.max_retries}} \
            --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
  params:
    max_new_tokens: 4096
    max_retries: 5
    parallelism: 10
    task: codegeneration
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      num_process_evaluate: 5
      cache_batch_size: 10
      support_system_role: false
      start_date: null
      end_date: null
      cot_code_execution: false
      release_version: release_v6
  supported_endpoint_types:
  - chat
  type: codegeneration_release_v6
target:
  api_endpoint: {}

livecodebench_0724_0125#

  • Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. - The data period and sampling parameters used by Artificial Analaysis (https://artificialanalysis.ai/methodology/intelligence-benchmarking)

Harness: livecodebench

Container:

nvcr.io/nvidia/eval-factory/livecodebench:26.01

Container Digest:

sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e

Container Arch: multiarch

Task Type: livecodebench_0724_0125

{% if target.api_endpoint.api_key_name is not none %}
  export API_KEY=${{target.api_endpoint.api_key_name}} && 
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
            --scenario {{config.params.task}} \
            --release_version {{config.params.extra.release_version}} \
            --url {{target.api_endpoint.url}} \
            --temperature {{config.params.temperature}} \
            --top_p {{config.params.top_p}} \
            --evaluate \
            --codegen_n {{config.params.extra.n_samples}} \
            --use_cache \
            --cache_batch_size {{config.params.extra.cache_batch_size}} \
            --num_process_evaluate {{config.params.extra.num_process_evaluate}} \
            --n {{config.params.extra.n_samples}} \
            --max_tokens {{config.params.max_new_tokens}} \
            --out_dir {{config.output_dir}} \
            --multiprocess {{config.params.parallelism}} \
            --max_retries {{config.params.max_retries}} \
            --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
  params:
    max_new_tokens: 4096
    max_retries: 5
    parallelism: 10
    task: codegeneration
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 3
      num_process_evaluate: 5
      cache_batch_size: 10
      support_system_role: false
      start_date: 2024-07-01
      end_date: 2025-01-01
      cot_code_execution: false
      release_version: release_v5
  supported_endpoint_types:
  - chat
  type: livecodebench_0724_0125
target:
  api_endpoint: {}

livecodebench_0824_0225#

[‘Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result.’, ‘The data period and sampling parameters used by NeMo Alignment team.’]

Harness: livecodebench

Container:

nvcr.io/nvidia/eval-factory/livecodebench:26.01

Container Digest:

sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e

Container Arch: multiarch

Task Type: livecodebench_0824_0225

{% if target.api_endpoint.api_key_name is not none %}
  export API_KEY=${{target.api_endpoint.api_key_name}} && 
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
            --scenario {{config.params.task}} \
            --release_version {{config.params.extra.release_version}} \
            --url {{target.api_endpoint.url}} \
            --temperature {{config.params.temperature}} \
            --top_p {{config.params.top_p}} \
            --evaluate \
            --codegen_n {{config.params.extra.n_samples}} \
            --use_cache \
            --cache_batch_size {{config.params.extra.cache_batch_size}} \
            --num_process_evaluate {{config.params.extra.num_process_evaluate}} \
            --n {{config.params.extra.n_samples}} \
            --max_tokens {{config.params.max_new_tokens}} \
            --out_dir {{config.output_dir}} \
            --multiprocess {{config.params.parallelism}} \
            --max_retries {{config.params.max_retries}} \
            --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
  params:
    max_new_tokens: 4096
    max_retries: 5
    parallelism: 10
    task: codegeneration
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 3
      num_process_evaluate: 5
      cache_batch_size: 10
      support_system_role: false
      start_date: 2024-08-01
      end_date: 2025-02-01
      cot_code_execution: false
      release_version: release_v5
  supported_endpoint_types:
  - chat
  type: livecodebench_0824_0225
target:
  api_endpoint: {}

livecodebench_aa_v2#

  • Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. - The data period and sampling parameters used by Artificial Analaysis (https://artificialanalysis.ai/methodology/intelligence-benchmarking)

Harness: livecodebench

Container:

nvcr.io/nvidia/eval-factory/livecodebench:26.01

Container Digest:

sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e

Container Arch: multiarch

Task Type: livecodebench_aa_v2

{% if target.api_endpoint.api_key_name is not none %}
  export API_KEY=${{target.api_endpoint.api_key_name}} && 
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
            --scenario {{config.params.task}} \
            --release_version {{config.params.extra.release_version}} \
            --url {{target.api_endpoint.url}} \
            --temperature {{config.params.temperature}} \
            --top_p {{config.params.top_p}} \
            --evaluate \
            --codegen_n {{config.params.extra.n_samples}} \
            --use_cache \
            --cache_batch_size {{config.params.extra.cache_batch_size}} \
            --num_process_evaluate {{config.params.extra.num_process_evaluate}} \
            --n {{config.params.extra.n_samples}} \
            --max_tokens {{config.params.max_new_tokens}} \
            --out_dir {{config.output_dir}} \
            --multiprocess {{config.params.parallelism}} \
            --max_retries {{config.params.max_retries}} \
            --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 10
    task: codegeneration
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 3
      num_process_evaluate: 5
      cache_batch_size: 10
      support_system_role: false
      start_date: 2024-07-01
      end_date: 2025-01-01
      cot_code_execution: false
      release_version: release_v5
  supported_endpoint_types:
  - chat
  type: livecodebench_aa_v2
target:
  api_endpoint: {}

testoutputprediction#

Solve the natural language task on a specified input, evaluating the ability to generate testing outputs. The model is given the natural language problem description and an input, and the output should be the output for the problem.

Harness: livecodebench

Container:

nvcr.io/nvidia/eval-factory/livecodebench:26.01

Container Digest:

sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e

Container Arch: multiarch

Task Type: testoutputprediction

{% if target.api_endpoint.api_key_name is not none %}
  export API_KEY=${{target.api_endpoint.api_key_name}} && 
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
            --scenario {{config.params.task}} \
            --release_version {{config.params.extra.release_version}} \
            --url {{target.api_endpoint.url}} \
            --temperature {{config.params.temperature}} \
            --top_p {{config.params.top_p}} \
            --evaluate \
            --codegen_n {{config.params.extra.n_samples}} \
            --use_cache \
            --cache_batch_size {{config.params.extra.cache_batch_size}} \
            --num_process_evaluate {{config.params.extra.num_process_evaluate}} \
            --n {{config.params.extra.n_samples}} \
            --max_tokens {{config.params.max_new_tokens}} \
            --out_dir {{config.output_dir}} \
            --multiprocess {{config.params.parallelism}} \
            --max_retries {{config.params.max_retries}} \
            --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
framework_name: livecodebench
pkg_name: livecodebench
config:
  params:
    max_new_tokens: 4096
    max_retries: 5
    parallelism: 10
    task: testoutputprediction
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      n_samples: 10
      num_process_evaluate: 5
      cache_batch_size: 10
      support_system_role: false
      start_date: null
      end_date: null
      cot_code_execution: false
      release_version: release_latest
  supported_endpoint_types:
  - chat
  type: testoutputprediction
target:
  api_endpoint: {}