LM Harness Evaluation Type#

LM Evaluation Harness supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8K, and IFEval. Use this evaluation type to benchmark general language understanding and reasoning tasks.

Prerequisites#

Target Configuration#

Set up or select an existing evaluation target. All LM Harness evaluations use the same target structure. Here’s an example targeting a NIM endpoint:

{
  "target": {
    "type": "model",
    "model": {
      "api_endpoint": {
        "url": "https://nim.int.aire.nvidia.com/chat/completions",
        "model_id": "meta/llama-3.3-70b-instruct"
      }
    }
  }
}

Field

Description

Required

Default

type

Always "model" for LM Harness evaluations.

Yes

url

The URL of the API endpoint for the model.

Yes

model_id

The model identifier.

Yes

stream

Whether to use streaming responses.

No

false


Supported Tasks#

Example LM Harness Tasks by Category#

Category

Example Task(s)

Description

Advanced Reasoning

bbh, musr, gpqa_diamond_cot

Big-Bench Hard, multistep reasoning, and graduate-level Q&A tasks.

Instruction Following

ifeval

Tests ability to follow specific instructions.

Language Understanding

mmlu

Massive Multitask Language Understanding; covers 57 subjects across STEM, humanities, and more.

Math & Reasoning

gsm8k

Grade school and advanced math word problems.

Multilingual Tasks

mgsm, wikilingua

Math word problems and translation tasks in multiple languages.

For the full list of LM Harness tasks, see the lm-evaluation-harness tasks directory or run python -m lm_eval --tasks list.


Advanced Reasoning (GPQA)#

{
  "type": "gpqa_diamond_cot",
  "name": "my-configuration-lm-harness-gpqa-diamond-cot-1",
  "namespace": "my-organization",
  "params": {
    "max_tokens": 1024,
    "temperature": 1.0,
    "top_p": 0.01,
    "extra": {
      "model_type": "chat",
      "hf_token": "hf_your_token_here",
    }
  }
}
{
  "question": "What is the capital of France?",
  "choices": ["Paris", "London", "Berlin", "Madrid"],
  "answer": "Paris",
  "output": "Paris"
}
{
  "tasks": {
    "gpqa_diamond_cot_zeroshot": {
      "metrics": {
        "exact_match__flexible-extract": {
          "scores": {
            "exact_match__flexible-extract": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Instruction Following (IFEval)#

{
  "type": "ifeval",
  "name": "my-configuration-lm-harness-ifeval-1",
  "namespace": "my-organization",
  "params": {
    "max_retries": 5,
    "parallelism": 10,
    "request_timeout": 300,
    "limit_samples": 50,
    "temperature": 1.0,
    "top_p": 0.01,
    "max_tokens": 1024,
    "extra": {
      "model_type": "chat",
      "hf_token": "hf_your_token_here",
      "tokenizer": "meta-llama/Llama-3.2-1B-Instruct",
      "tokenizer_backend": "huggingface"
    }
  }
}
{
  "prompt": "Write a short story about a cat. The story must be exactly 3 sentences long.",
  "instruction_id_list": ["length_constraints:number_sentences"],
  "kwargs": [{"num_sentences": 3}],
  "output": "The cat sat by the window. It watched the birds outside. Then it fell asleep in the warm sunlight."
}
{
  "tasks": {
    "ifeval": {
      "metrics": {
        "prompt_level_strict_acc": {
          "scores": {
            "prompt_level_strict_acc": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Language Understanding (MMLU)#

{
  "type": "mmlu",
  "name": "my-configuration-lm-harness-mmlu-1",
  "namespace": "my-organization",
  "params": {
    "extra": {
      "model_type": "completions",
      "num_fewshot": 5,
      "hf_token": "hf_your_token_here",
      "tokenizer": "meta-llama/Llama-3.3-70B-Instruct",
      "tokenizer_backend": "huggingface"
    }
  }
}
{
  "question": "Which of the following is a prime number?",
  "choices": ["4", "6", "7", "8"],
  "answer": "7",
  "output": "7"
}
{
  "tasks": {
    "mmlu_abstract_algebra": {
      "metrics": {
        "accuracy": {
          "scores": {
            "accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Math & Reasoning (GSM8K)#

{
  "type": "gsm8k",
  "name": "my-configuration-lm-harness-gsm8k-1",
  "namespace": "my-organization",
  "params": {
    "temperature": 1.0,
    "top_p": 0.01,
    "max_tokens": 256,
    "parallelism": 10,
    "extra": {
      "model_type": "completions",
      "num_fewshot": 8,
      "hf_token": "hf_your_token_here",
      "tokenizer": "meta-llama/Llama-3.3-70B-Instruct",
      "tokenizer_backend": "huggingface"
    }
  }
}
{
  "question": "If you have 3 apples and you get 2 more, how many apples do you have?",
  "answer": "5",
  "output": "5"
}
{
  "tasks": {
    "gsm8k": {
      "metrics": {
        "accuracy": {
          "scores": {
            "accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

For the full list of LM Harness tasks, refer to tasks.


Parameters#

Request Parameters#

These parameters control how requests are made to the model:

Name

Description

Type

Default

Valid Values

max_retries

Maximum number of retries for failed requests.

Integer

Container default

parallelism

Number of parallel requests to improve throughput.

Integer

Container default

request_timeout

Timeout in seconds for each request.

Integer

Container default

limit_samples

Limit the number of samples to evaluate. Useful for testing.

Integer

null (all samples)

Model Parameters#

These parameters control the model’s generation behavior:

Name

Description

Type

Default

Valid Values

temperature

Sampling temperature for generation.

Float

Container default

0.0–2.0

top_p

Nucleus sampling parameter.

Float

0.01

0.01–1.0

max_tokens

Maximum number of tokens to generate.

Integer

Container default

stop

Up to 4 sequences where the API will stop generating further tokens.

Array of strings

Extra Parameters#

Set these parameters in the params.extra section:

Name

Description

Type

Default

Valid Values

model_type

Type of model interface to use. Required for the underlying container, but Evaluator will attempt to guess if not provided.

String

Autodetected

"completions", "chat", "vlm"

hf_token

HuggingFace token for accessing datasets and tokenizers. Required for tasks that fetch from HuggingFace.

String

Valid HF token

tokenizer

Path to the tokenizer model. If missing, will attempt to use the target model from HuggingFace.

String

Target model name

HuggingFace model path

tokenizer_backend

System for loading the tokenizer.

String

"huggingface"

"huggingface", "tiktoken"

num_fewshot

Number of examples in few-shot context.

Integer

Task-dependent

tokenized_requests

Whether to use tokenized requests.

Boolean

false

true, false

downsampling_ratio

Ratio for downsampling the dataset.

Float

null (no downsampling)

0.0–1.0

Important Notes#

  • model_type: Different tasks support different model types. Tasks that support both “completions” and “chat” will default to “chat.” If no preference is detected, defaults to “completions.”

  • hf_token: Required for tasks that fetch datasets or tokenizers from HuggingFace. Errors will appear in run logs if missing or insufficient permissions.

  • tokenizer: Some tasks require a tokenizer. NVIDIA internal model names are often lowercase while HuggingFace models use different casing, which can cause failures if not specified correctly.

Metrics#

Core Supported Metrics in LM Evaluation Harness#

Metric Name

Description

Value Range

Notes

acc

Accuracy (fraction of correct predictions)

0–1

Most common for classification tasks

acc_norm

Length-normalized accuracy

0–1

Normalizes for answer length

acc_mutual_info

Baseline loglikelihood - normalized accuracy

Task-dependent

Used in some specialized tasks

perplexity

Perplexity (measure of model uncertainty)

>0

Lower is better

word_perplexity

Perplexity per word

>0

Lower is better

byte_perplexity

Perplexity per byte

>0

Lower is better

bits_per_byte

Bits per byte

>0

Lower is better

matthews_corrcoef

Matthews correlation coefficient

-1-1

For binary/multiclass classification

f1

F1 score (harmonic mean of precision and recall)

0–1

For classification/QA tasks

bleu

BLEU score (text generation quality)

0–100

For translation/generation tasks

chrf

Character F-score (CHRF)

0–100

For translation/generation tasks

ter

Translation Edit Rate (TER)

0–1

For translation tasks

prompt_level_strict_acc

Prompt-level strict accuracy for instruction following

0–1

For instruction following tasks like IFEval

pass@1

Pass rate for code generation (first attempt)

0–1

For code generation tasks

Not all metrics are available for every task. Check the task definition for the exact metrics used.