codec#
This page contains all evaluation tasks for the codec harness.
Task |
Description |
|---|---|
Task for detecting contamination with the AIME 2024 dataset |
|
Task for detecting contamination with the AIME 2025 dataset |
|
Task for detecting contamination with the BBQ dataset |
|
Task for detecting contamination with the BFCL v3 dataset |
|
Task for detecting contamination with the FRAMES dataset |
|
Task for detecting contamination with the GPQA diamond |
|
Task for detecting contamination with the GSM8K test set |
|
Task for detecting contamination with the GSM8K train set |
|
Task for detecting contamination with the Hellaswag test set |
|
Task for detecting contamination with the Hellaswag train set |
|
Task for detecting contamination with the HLE dataset |
|
Task for detecting contamination with the IFBench dataset |
|
Task for detecting contamination with the IFeval dataset |
|
Task for detecting contamination with the LiveCodeBench v1 dataset |
|
Task for detecting contamination with the LiveCodeBench v5 dataset |
|
Task for detecting contamination with the Math 500 dataset (problem statements) |
|
Task for detecting contamination with the Math 500 dataset (solutions) |
|
Task for detecting contamination with the MMLU-Pro test set |
|
Task for detecting contamination with the MMLU test set |
|
Task for detecting contamination with the OpenAI HumanEval dataset |
|
Task for detecting contamination with the Reward Bench v1 dataset |
|
Task for detecting contamination with the Reward Bench v2 dataset |
|
Task for detecting contamination with the SciCode dataset |
|
Task for detecting contamination with the SWE-bench dataset (test split) |
|
Task for detecting contamination with the SWE-bench dataset (train split) |
|
Task for detecting contamination with the Tau-bench dataset |
|
Task for detecting contamination with the Terminal-Bench dataset |
aime_2024#
Task for detecting contamination with the AIME 2024 dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: aime_2024
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: aime_2024
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: aime_2024
target:
api_endpoint: {}
aime_2025#
Task for detecting contamination with the AIME 2025 dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: aime_2025
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: aime_2025
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: aime_2025
target:
api_endpoint: {}
bbq#
Task for detecting contamination with the BBQ dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: bbq
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: bbq
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: bbq
target:
api_endpoint: {}
bfcl_v3#
Task for detecting contamination with the BFCL v3 dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: bfcl_v3
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: bfcl_v3
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: bfcl_v3
target:
api_endpoint: {}
frames#
Task for detecting contamination with the FRAMES dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: frames
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: frames
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: frames
target:
api_endpoint: {}
gpqa_diamond#
Task for detecting contamination with the GPQA diamond
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: gpqa_diamond
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: gpqa_diamond
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: gpqa_diamond
target:
api_endpoint: {}
gsm8k_test#
Task for detecting contamination with the GSM8K test set
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: gsm8k_test
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: gsm8k_test
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: gsm8k_test
target:
api_endpoint: {}
gsm8k_train#
Task for detecting contamination with the GSM8K train set
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: gsm8k_train
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: gsm8k_train
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: gsm8k_train
target:
api_endpoint: {}
hellaswag_test#
Task for detecting contamination with the Hellaswag test set
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: hellaswag_test
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: hellaswag_test
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: hellaswag_test
target:
api_endpoint: {}
hellaswag_train#
Task for detecting contamination with the Hellaswag train set
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: hellaswag_train
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: hellaswag_train
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: hellaswag_train
target:
api_endpoint: {}
hle#
Task for detecting contamination with the HLE dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: hle
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: hle
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: hle
target:
api_endpoint: {}
ifbench#
Task for detecting contamination with the IFBench dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: ifbench
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: ifbench
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: ifbench
target:
api_endpoint: {}
ifeval#
Task for detecting contamination with the IFeval dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: ifeval
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: ifeval
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: ifeval
target:
api_endpoint: {}
livecodebench_v1#
Task for detecting contamination with the LiveCodeBench v1 dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: livecodebench_v1
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: livecodebench_v1
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: livecodebench_v1
target:
api_endpoint: {}
livecodebench_v5#
Task for detecting contamination with the LiveCodeBench v5 dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: livecodebench_v5
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: livecodebench_v5
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: livecodebench_v5
target:
api_endpoint: {}
math_500_problem#
Task for detecting contamination with the Math 500 dataset (problem statements)
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: math_500_problem
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: math_500_problem
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: math_500_problem
target:
api_endpoint: {}
math_500_solution#
Task for detecting contamination with the Math 500 dataset (solutions)
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: math_500_solution
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: math_500_solution
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: math_500_solution
target:
api_endpoint: {}
mmlu_pro_test#
Task for detecting contamination with the MMLU-Pro test set
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: mmlu_pro_test
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: mmlu_pro_test
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: mmlu_pro_test
target:
api_endpoint: {}
mmlu_test#
Task for detecting contamination with the MMLU test set
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: mmlu_test
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: mmlu_test
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: mmlu_test
target:
api_endpoint: {}
openai_humaneval#
Task for detecting contamination with the OpenAI HumanEval dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: openai_humaneval
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: openai_humaneval
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: openai_humaneval
target:
api_endpoint: {}
reward_bench_v1#
Task for detecting contamination with the Reward Bench v1 dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: reward_bench_v1
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: reward_bench_v1
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: reward_bench_v1
target:
api_endpoint: {}
reward_bench_v2#
Task for detecting contamination with the Reward Bench v2 dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: reward_bench_v2
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: reward_bench_v2
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: reward_bench_v2
target:
api_endpoint: {}
scicode#
Task for detecting contamination with the SciCode dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: scicode
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: scicode
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: scicode
target:
api_endpoint: {}
swebench_test#
Task for detecting contamination with the SWE-bench dataset (test split)
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: swebench_test
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: swebench_test
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: swebench_test
target:
api_endpoint: {}
swebench_train#
Task for detecting contamination with the SWE-bench dataset (train split)
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: swebench_train
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: swebench_train
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: swebench_train
target:
api_endpoint: {}
taubench#
Task for detecting contamination with the Tau-bench dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: taubench
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: taubench
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: taubench
target:
api_endpoint: {}
terminalbench#
Task for detecting contamination with the Terminal-Bench dataset
Harness: codec
Container:
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
Container Digest:
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
Container Arch: amd
Task Type: terminalbench
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: terminalbench
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: terminalbench
target:
api_endpoint: {}