helm#

This page contains all evaluation tasks for the helm harness.

Task

Description

aci_bench

Extract and structure information from patient-doctor conversations

ehr_sql

Given a natural language instruction, generate an SQL query that would be used in clinical research.

head_qa

A collection of biomedical multiple-choice questions for testing medical knowledge (Vilares et al., 2019).

med_dialog_healthcaremagic

Generate summaries of doctor-patient conversations, healthcaremagic version

med_dialog_icliniq

Generate summaries of doctor-patient conversations, icliniq version

medbullets

A USMLE-style medical question dataset with multiple-choice answers and explanations (MedBullets, 2025).

medcalc_bench

A dataset which consists of a patient note, a question requesting to compute a specific medical value, and a ground truth answer (Khandekar et al., 2024).

medec

A dataset containing medical narratives with error detection and correction pairs (Abacha et al., 2025).

medhallu

A dataset of PubMed articles and associated questions, with the objective being to classify whether the answer is factual or hallucinated.

medi_qa

Retrieve and rank answers based on medical question understanding

medication_qa

Answer consumer medication-related questions

mtsamples_procedures

Document and extract information about medical procedures

mtsamples_replicate

Generate treatment plans based on clinical notes

pubmed_qa

A dataset that provides PubMed abstracts and asks associated questions (yes/no/maybe format).

race_based_med

A collection of LLM outputs in response to medical questions with race-based biases, with the objective being to classify whether the output contains racially biased content.

aci_bench#

Extract and structure information from patient-doctor conversations

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: aci_bench

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: aci_bench
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: null
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: aci_bench
target:
  api_endpoint: {}

ehr_sql#

Given a natural language instruction, generate an SQL query that would be used in clinical research.

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: ehr_sql

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: ehr_sql
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: null
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: ehr_sql
target:
  api_endpoint: {}

head_qa#

A collection of biomedical multiple-choice questions for testing medical knowledge (Vilares et al., 2019).

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: head_qa

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: head_qa
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: null
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: head_qa
target:
  api_endpoint: {}

med_dialog_healthcaremagic#

Generate summaries of doctor-patient conversations, healthcaremagic version

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: med_dialog_healthcaremagic

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: med_dialog
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: healthcaremagic
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: med_dialog_healthcaremagic
target:
  api_endpoint: {}

med_dialog_icliniq#

Generate summaries of doctor-patient conversations, icliniq version

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: med_dialog_icliniq

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: med_dialog
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: icliniq
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: med_dialog_icliniq
target:
  api_endpoint: {}

medbullets#

A USMLE-style medical question dataset with multiple-choice answers and explanations (MedBullets, 2025).

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: medbullets

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: medbullets
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: null
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: medbullets
target:
  api_endpoint: {}

medcalc_bench#

A dataset which consists of a patient note, a question requesting to compute a specific medical value, and a ground truth answer (Khandekar et al., 2024).

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: medcalc_bench

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: medcalc_bench
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: null
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: medcalc_bench
target:
  api_endpoint: {}

medec#

A dataset containing medical narratives with error detection and correction pairs (Abacha et al., 2025).

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: medec

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: medec
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: null
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: medec
target:
  api_endpoint: {}

medhallu#

A dataset of PubMed articles and associated questions, with the objective being to classify whether the answer is factual or hallucinated.

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: medhallu

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: medhallu
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: null
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: medhallu
target:
  api_endpoint: {}

medi_qa#

Retrieve and rank answers based on medical question understanding

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: medi_qa

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: medi_qa
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: null
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: medi_qa
target:
  api_endpoint: {}

medication_qa#

Answer consumer medication-related questions

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: medication_qa

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: medication_qa
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: null
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: medication_qa
target:
  api_endpoint: {}

mtsamples_procedures#

Document and extract information about medical procedures

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: mtsamples_procedures

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: mtsamples_procedures
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: null
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: mtsamples_procedures
target:
  api_endpoint: {}

mtsamples_replicate#

Generate treatment plans based on clinical notes

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: mtsamples_replicate

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: mtsamples_replicate
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: null
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: mtsamples_replicate
target:
  api_endpoint: {}

pubmed_qa#

A dataset that provides PubMed abstracts and asks associated questions (yes/no/maybe format).

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: pubmed_qa

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: pubmed_qa
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: null
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: pubmed_qa
target:
  api_endpoint: {}

race_based_med#

A collection of LLM outputs in response to medical questions with race-based biases, with the objective being to classify whether the output contains racially biased content.

Harness: helm

Container:

nvcr.io/nvidia/eval-factory/helm:26.01

Container Digest:

sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589

Container Arch: amd

Task Type: race_based_med

{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs  --model-name {{target.api_endpoint.model_id}}  --base-url {{target.api_endpoint.url}}  --openai-model-name {{target.api_endpoint.model_id}}  --output-dir {{config.output_dir}} && helm-run  --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}}  {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}}  {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}}  {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %}  --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %}  --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %}  --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %}  --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %}  --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %}  --max-length {{config.params.extra.max_length}} {% endif %}  -o {{config.output_dir}}  --local-path {{config.output_dir}}
framework_name: helm
pkg_name: helm
config:
  params:
    parallelism: 1
    task: race_based_med
    extra:
      data_path: null
      num_output_tokens: null
      subject: null
      condition: null
      max_length: null
      num_train_trials: null
      subset: null
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: race_based_med
target:
  api_endpoint: {}