codec#

This page contains all evaluation tasks for the codec harness.

Task

Description

aime_2024

Task for detecting contamination with the AIME 2024 dataset

aime_2025

Task for detecting contamination with the AIME 2025 dataset

bbq

Task for detecting contamination with the BBQ dataset

bfcl_v3

Task for detecting contamination with the BFCL v3 dataset

frames

Task for detecting contamination with the FRAMES dataset

gpqa_diamond

Task for detecting contamination with the GPQA diamond

gsm8k_test

Task for detecting contamination with the GSM8K test set

gsm8k_train

Task for detecting contamination with the GSM8K train set

hellaswag_test

Task for detecting contamination with the Hellaswag test set

hellaswag_train

Task for detecting contamination with the Hellaswag train set

hle

Task for detecting contamination with the HLE dataset

ifbench

Task for detecting contamination with the IFBench dataset

ifeval

Task for detecting contamination with the IFeval dataset

livecodebench_v1

Task for detecting contamination with the LiveCodeBench v1 dataset

livecodebench_v5

Task for detecting contamination with the LiveCodeBench v5 dataset

math_500_problem

Task for detecting contamination with the Math 500 dataset (problem statements)

math_500_solution

Task for detecting contamination with the Math 500 dataset (solutions)

mmlu_pro_test

Task for detecting contamination with the MMLU-Pro test set

mmlu_test

Task for detecting contamination with the MMLU test set

openai_humaneval

Task for detecting contamination with the OpenAI HumanEval dataset

reward_bench_v1

Task for detecting contamination with the Reward Bench v1 dataset

reward_bench_v2

Task for detecting contamination with the Reward Bench v2 dataset

scicode

Task for detecting contamination with the SciCode dataset

swebench_test

Task for detecting contamination with the SWE-bench dataset (test split)

swebench_train

Task for detecting contamination with the SWE-bench dataset (train split)

taubench

Task for detecting contamination with the Tau-bench dataset

terminalbench

Task for detecting contamination with the Terminal-Bench dataset

aime_2024#

Task for detecting contamination with the AIME 2024 dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: aime_2024

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: aime_2024
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: aime_2024
target:
  api_endpoint: {}

aime_2025#

Task for detecting contamination with the AIME 2025 dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: aime_2025

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: aime_2025
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: aime_2025
target:
  api_endpoint: {}

bbq#

Task for detecting contamination with the BBQ dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: bbq

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: bbq
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: bbq
target:
  api_endpoint: {}

bfcl_v3#

Task for detecting contamination with the BFCL v3 dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: bfcl_v3

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: bfcl_v3
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: bfcl_v3
target:
  api_endpoint: {}

frames#

Task for detecting contamination with the FRAMES dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: frames

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: frames
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: frames
target:
  api_endpoint: {}

gpqa_diamond#

Task for detecting contamination with the GPQA diamond

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: gpqa_diamond

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: gpqa_diamond
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: gpqa_diamond
target:
  api_endpoint: {}

gsm8k_test#

Task for detecting contamination with the GSM8K test set

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: gsm8k_test

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: gsm8k_test
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: gsm8k_test
target:
  api_endpoint: {}

gsm8k_train#

Task for detecting contamination with the GSM8K train set

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: gsm8k_train

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: gsm8k_train
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: gsm8k_train
target:
  api_endpoint: {}

hellaswag_test#

Task for detecting contamination with the Hellaswag test set

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: hellaswag_test

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: hellaswag_test
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: hellaswag_test
target:
  api_endpoint: {}

hellaswag_train#

Task for detecting contamination with the Hellaswag train set

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: hellaswag_train

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: hellaswag_train
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: hellaswag_train
target:
  api_endpoint: {}

hle#

Task for detecting contamination with the HLE dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: hle

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: hle
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: hle
target:
  api_endpoint: {}

ifbench#

Task for detecting contamination with the IFBench dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: ifbench

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: ifbench
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: ifbench
target:
  api_endpoint: {}

ifeval#

Task for detecting contamination with the IFeval dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: ifeval

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: ifeval
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: ifeval
target:
  api_endpoint: {}

livecodebench_v1#

Task for detecting contamination with the LiveCodeBench v1 dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: livecodebench_v1

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: livecodebench_v1
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: livecodebench_v1
target:
  api_endpoint: {}

livecodebench_v5#

Task for detecting contamination with the LiveCodeBench v5 dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: livecodebench_v5

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: livecodebench_v5
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: livecodebench_v5
target:
  api_endpoint: {}

math_500_problem#

Task for detecting contamination with the Math 500 dataset (problem statements)

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: math_500_problem

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: math_500_problem
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: math_500_problem
target:
  api_endpoint: {}

math_500_solution#

Task for detecting contamination with the Math 500 dataset (solutions)

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: math_500_solution

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: math_500_solution
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: math_500_solution
target:
  api_endpoint: {}

mmlu_pro_test#

Task for detecting contamination with the MMLU-Pro test set

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: mmlu_pro_test

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: mmlu_pro_test
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: mmlu_pro_test
target:
  api_endpoint: {}

mmlu_test#

Task for detecting contamination with the MMLU test set

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: mmlu_test

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: mmlu_test
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: mmlu_test
target:
  api_endpoint: {}

openai_humaneval#

Task for detecting contamination with the OpenAI HumanEval dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: openai_humaneval

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: openai_humaneval
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: openai_humaneval
target:
  api_endpoint: {}

reward_bench_v1#

Task for detecting contamination with the Reward Bench v1 dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: reward_bench_v1

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: reward_bench_v1
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: reward_bench_v1
target:
  api_endpoint: {}

reward_bench_v2#

Task for detecting contamination with the Reward Bench v2 dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: reward_bench_v2

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: reward_bench_v2
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: reward_bench_v2
target:
  api_endpoint: {}

scicode#

Task for detecting contamination with the SciCode dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: scicode

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: scicode
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: scicode
target:
  api_endpoint: {}

swebench_test#

Task for detecting contamination with the SWE-bench dataset (test split)

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: swebench_test

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: swebench_test
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: swebench_test
target:
  api_endpoint: {}

swebench_train#

Task for detecting contamination with the SWE-bench dataset (train split)

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: swebench_train

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: swebench_train
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: swebench_train
target:
  api_endpoint: {}

taubench#

Task for detecting contamination with the Tau-bench dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: taubench

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: taubench
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: taubench
target:
  api_endpoint: {}

terminalbench#

Task for detecting contamination with the Terminal-Bench dataset

Harness: codec

Container:

nvcr.io/nvidia/eval-factory/contamination-detection:26.01

Container Digest:

sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2

Container Arch: amd

Task Type: terminalbench

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
  params:
    limit_samples: 1000
    max_retries: 10
    parallelism: 20
    task: terminalbench
    temperature: 0.0
    request_timeout: 120
    top_p: 1.0
    extra:
      contamination_type: in_context
      n_context_seeds: 5
      min_length: 100
      max_length: 2048
  supported_endpoint_types:
  - completions
  type: terminalbench
target:
  api_endpoint: {}