AA-LCR#

This page contains all evaluation tasks for the AA-LCR harness.

Task

Description

aa_lcr

A challenging benchmark measuring language models’ ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

aa_lcr#

A challenging benchmark measuring language models’ ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

Harness: AA-LCR

Container:

nvcr.io/nvidia/eval-factory/aa-lcr:26.01

Container Digest:

sha256:67dd35302ed15610afc9471a2ff4f515d95a235753f1b259db60748249366939

Container Arch: multiarch

Task Type: aa_lcr

aa_lcr --model={{target.api_endpoint.model_id}} --endpoint_url={{target.api_endpoint.url}}  --temperature={{config.params.temperature}} --top_p={{config.params.top_p}} --request_timeout={{config.params.request_timeout}}  {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --output_dir={{config.output_dir}}  {% if target.api_endpoint.api_key_name is not none %}--api_key_name={{target.api_endpoint.api_key_name}}{% endif %} --max_retries={{config.params.max_retries}}  --max_new_tokens={{config.params.max_new_tokens}} --async_limit={{config.params.parallelism}} --num_repeats={{config.params.extra.n_samples}} --seed={{config.params.extra.seed}} --judge_model={{config.params.extra.judge.model_id}} --judge_url={{config.params.extra.judge.url}} --judge_temperature={{config.params.extra.judge.temperature}} --judge_top_p={{config.params.extra.judge.top_p}} --judge_max_new_tokens={{config.params.extra.judge.max_new_tokens}} --judge_async_limit={{config.params.extra.judge.parallelism}} {% if config.params.extra.judge.api_key is defined %}--judge_api_key_name={{config.params.extra.judge.api_key}}{% endif %}
framework_name: AA-LCR
pkg_name: aa_lcr
config:
  params:
    max_new_tokens: 16384
    max_retries: 30
    parallelism: 10
    temperature: 0.0
    request_timeout: 600
    top_p: 1.0
    extra:
      n_samples: 3
      seed: 42
      judge:
        url: https://integrate.api.nvidia.com/v1/chat/completions
        model_id: nvdev/qwen/qwen-235b
        request_timeout: 600
        max_retries: 30
        temperature: 0.0
        top_p: 1.0
        max_new_tokens: 1024
        parallelism: 10
        api_key: JUDGE_API_KEY
  supported_endpoint_types:
  - chat
  type: aa_lcr
target:
  api_endpoint: {}