helm#
This page contains all evaluation tasks for the helm harness.
Task |
Description |
|---|---|
Extract and structure information from patient-doctor conversations |
|
Given a natural language instruction, generate an SQL query that would be used in clinical research. |
|
A collection of biomedical multiple-choice questions for testing medical knowledge (Vilares et al., 2019). |
|
Generate summaries of doctor-patient conversations, healthcaremagic version |
|
Generate summaries of doctor-patient conversations, icliniq version |
|
A USMLE-style medical question dataset with multiple-choice answers and explanations (MedBullets, 2025). |
|
A dataset which consists of a patient note, a question requesting to compute a specific medical value, and a ground truth answer (Khandekar et al., 2024). |
|
A dataset containing medical narratives with error detection and correction pairs (Abacha et al., 2025). |
|
A dataset of PubMed articles and associated questions, with the objective being to classify whether the answer is factual or hallucinated. |
|
Retrieve and rank answers based on medical question understanding |
|
Answer consumer medication-related questions |
|
Document and extract information about medical procedures |
|
Generate treatment plans based on clinical notes |
|
A dataset that provides PubMed abstracts and asks associated questions (yes/no/maybe format). |
|
A collection of LLM outputs in response to medical questions with race-based biases, with the objective being to classify whether the output contains racially biased content. |
aci_bench#
Extract and structure information from patient-doctor conversations
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: aci_bench
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: aci_bench
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: aci_bench
target:
api_endpoint: {}
ehr_sql#
Given a natural language instruction, generate an SQL query that would be used in clinical research.
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: ehr_sql
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: ehr_sql
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: ehr_sql
target:
api_endpoint: {}
head_qa#
A collection of biomedical multiple-choice questions for testing medical knowledge (Vilares et al., 2019).
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: head_qa
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: head_qa
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: head_qa
target:
api_endpoint: {}
med_dialog_healthcaremagic#
Generate summaries of doctor-patient conversations, healthcaremagic version
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: med_dialog_healthcaremagic
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: med_dialog
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: healthcaremagic
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: med_dialog_healthcaremagic
target:
api_endpoint: {}
med_dialog_icliniq#
Generate summaries of doctor-patient conversations, icliniq version
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: med_dialog_icliniq
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: med_dialog
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: icliniq
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: med_dialog_icliniq
target:
api_endpoint: {}
medbullets#
A USMLE-style medical question dataset with multiple-choice answers and explanations (MedBullets, 2025).
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: medbullets
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: medbullets
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: medbullets
target:
api_endpoint: {}
medcalc_bench#
A dataset which consists of a patient note, a question requesting to compute a specific medical value, and a ground truth answer (Khandekar et al., 2024).
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: medcalc_bench
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: medcalc_bench
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: medcalc_bench
target:
api_endpoint: {}
medec#
A dataset containing medical narratives with error detection and correction pairs (Abacha et al., 2025).
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: medec
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: medec
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: medec
target:
api_endpoint: {}
medhallu#
A dataset of PubMed articles and associated questions, with the objective being to classify whether the answer is factual or hallucinated.
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: medhallu
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: medhallu
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: medhallu
target:
api_endpoint: {}
medi_qa#
Retrieve and rank answers based on medical question understanding
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: medi_qa
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: medi_qa
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: medi_qa
target:
api_endpoint: {}
medication_qa#
Answer consumer medication-related questions
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: medication_qa
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: medication_qa
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: medication_qa
target:
api_endpoint: {}
mtsamples_procedures#
Document and extract information about medical procedures
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: mtsamples_procedures
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: mtsamples_procedures
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: mtsamples_procedures
target:
api_endpoint: {}
mtsamples_replicate#
Generate treatment plans based on clinical notes
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: mtsamples_replicate
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: mtsamples_replicate
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: mtsamples_replicate
target:
api_endpoint: {}
pubmed_qa#
A dataset that provides PubMed abstracts and asks associated questions (yes/no/maybe format).
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: pubmed_qa
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: pubmed_qa
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: pubmed_qa
target:
api_endpoint: {}
race_based_med#
A collection of LLM outputs in response to medical questions with race-based biases, with the objective being to classify whether the output contains racially biased content.
Harness: helm
Container:
nvcr.io/nvidia/eval-factory/helm:26.01
Container Digest:
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
Container Arch: amd
Task Type: race_based_med
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: race_based_med
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: race_based_med
target:
api_endpoint: {}