scicode#

This page contains all evaluation tasks for the scicode harness.

Task

Description

scicode

  • SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - Includes default system prompt (“You are a helpful assistant.”).

scicode_aa_v2

  • SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - This variant mimicks setup used by Artificial Analysis in their Intelligence Benchmark (v2). - It includes scientist-annotated background in the prompts and uses all available problems for evaluation (including “dev” set). - Does not include a default system prompt (“You are a helpful assistant.”).

scicode_background

  • SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - This variant includes scientist-annotated background in the prompts. - Includes default system prompt (“You are a helpful assistant.”).

scicode#

  • SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - Includes default system prompt (“You are a helpful assistant.”).

Harness: scicode

Container:

nvcr.io/nvidia/eval-factory/scicode:26.01

Container Digest:

sha256:f5c12499db7d8b415321c4242e5625ed69affdc1632056326790e5d55a4656e0

Container Arch: multiarch

Task Type: scicode

{% if target.api_endpoint.api_key_name is not none %}API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} scicode_eval --model {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} --output-dir {{config.output_dir}}/scicode_results --log-dir {{config.output_dir}}/logs {% if config.params.temperature is not none %}--temperature={{config.params.temperature}}{% endif %} {% if config.params.limit_samples is not none %}--limit-samples={{config.params.limit_samples}}{% endif %} --n-samples={{config.params.extra.n_samples}} --extra-params top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},max_tokens={{config.params.max_new_tokens}},max_retries={{config.params.max_retries}},include_system_prompt={{config.params.extra.include_system_prompt}} {% if config.params.extra.with_background %}--with-background {% endif %} {% if config.params.extra.include_dev %}--include-dev{% endif %} {% if config.params.extra.eval_threads is not none %}--eval-threads={{config.params.extra.eval_threads}}{% endif %} {% if config.params.extra.regex_path is not none %}--regex-path={{config.params.extra.regex_path}}{% endif %} {% if config.params.extra.prompt_template_type is not none %}--prompt-template-type={{config.params.extra.prompt_template_type}}{% endif %} --concurrent-requests={{config.params.parallelism}}
framework_name: scicode
pkg_name: scicode
config:
  params:
    max_new_tokens: 2048
    max_retries: 2
    parallelism: 1
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      with_background: false
      include_dev: false
      n_samples: 1
      eval_threads: null
      include_system_prompt: true
      regex_path: null
      prompt_template_type: null
  supported_endpoint_types:
  - chat
  type: scicode
target:
  api_endpoint:
    stream: false

scicode_aa_v2#

  • SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - This variant mimicks setup used by Artificial Analysis in their Intelligence Benchmark (v2). - It includes scientist-annotated background in the prompts and uses all available problems for evaluation (including “dev” set). - Does not include a default system prompt (“You are a helpful assistant.”).

Harness: scicode

Container:

nvcr.io/nvidia/eval-factory/scicode:26.01

Container Digest:

sha256:f5c12499db7d8b415321c4242e5625ed69affdc1632056326790e5d55a4656e0

Container Arch: multiarch

Task Type: scicode_aa_v2

{% if target.api_endpoint.api_key_name is not none %}API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} scicode_eval --model {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} --output-dir {{config.output_dir}}/scicode_results --log-dir {{config.output_dir}}/logs {% if config.params.temperature is not none %}--temperature={{config.params.temperature}}{% endif %} {% if config.params.limit_samples is not none %}--limit-samples={{config.params.limit_samples}}{% endif %} --n-samples={{config.params.extra.n_samples}} --extra-params top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},max_tokens={{config.params.max_new_tokens}},max_retries={{config.params.max_retries}},include_system_prompt={{config.params.extra.include_system_prompt}} {% if config.params.extra.with_background %}--with-background {% endif %} {% if config.params.extra.include_dev %}--include-dev{% endif %} {% if config.params.extra.eval_threads is not none %}--eval-threads={{config.params.extra.eval_threads}}{% endif %} {% if config.params.extra.regex_path is not none %}--regex-path={{config.params.extra.regex_path}}{% endif %} {% if config.params.extra.prompt_template_type is not none %}--prompt-template-type={{config.params.extra.prompt_template_type}}{% endif %} --concurrent-requests={{config.params.parallelism}}
framework_name: scicode
pkg_name: scicode
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 1
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      with_background: true
      include_dev: true
      n_samples: 3
      eval_threads: null
      include_system_prompt: false
      regex_path: aa_regex.txt
      prompt_template_type: background_comment_template.txt
  supported_endpoint_types:
  - chat
  type: scicode_aa_v2
target:
  api_endpoint:
    stream: false

scicode_background#

  • SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - This variant includes scientist-annotated background in the prompts. - Includes default system prompt (“You are a helpful assistant.”).

Harness: scicode

Container:

nvcr.io/nvidia/eval-factory/scicode:26.01

Container Digest:

sha256:f5c12499db7d8b415321c4242e5625ed69affdc1632056326790e5d55a4656e0

Container Arch: multiarch

Task Type: scicode_background

{% if target.api_endpoint.api_key_name is not none %}API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} scicode_eval --model {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} --output-dir {{config.output_dir}}/scicode_results --log-dir {{config.output_dir}}/logs {% if config.params.temperature is not none %}--temperature={{config.params.temperature}}{% endif %} {% if config.params.limit_samples is not none %}--limit-samples={{config.params.limit_samples}}{% endif %} --n-samples={{config.params.extra.n_samples}} --extra-params top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},max_tokens={{config.params.max_new_tokens}},max_retries={{config.params.max_retries}},include_system_prompt={{config.params.extra.include_system_prompt}} {% if config.params.extra.with_background %}--with-background {% endif %} {% if config.params.extra.include_dev %}--include-dev{% endif %} {% if config.params.extra.eval_threads is not none %}--eval-threads={{config.params.extra.eval_threads}}{% endif %} {% if config.params.extra.regex_path is not none %}--regex-path={{config.params.extra.regex_path}}{% endif %} {% if config.params.extra.prompt_template_type is not none %}--prompt-template-type={{config.params.extra.prompt_template_type}}{% endif %} --concurrent-requests={{config.params.parallelism}}
framework_name: scicode
pkg_name: scicode
config:
  params:
    max_new_tokens: 2048
    max_retries: 2
    parallelism: 1
    temperature: 0.0
    request_timeout: 60
    top_p: 1.0e-05
    extra:
      with_background: true
      include_dev: false
      n_samples: 1
      eval_threads: null
      include_system_prompt: true
      regex_path: null
      prompt_template_type: null
  supported_endpoint_types:
  - chat
  type: scicode_background
target:
  api_endpoint:
    stream: false