LM Harness Evaluation Type#
LM Evaluation Harness supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8K, and HellaSwag. Use this evaluation type to benchmark general language understanding and reasoning tasks.
Prerequisites#
Set up or select an existing evaluation target.
Supported Tasks#
Category |
Example Task(s) |
Description |
---|---|---|
Language Understanding |
|
Massive Multitask Language Understanding; covers 57 subjects across STEM, humanities, and more. |
Math & Reasoning |
|
Grade school and advanced math word problems. |
Commonsense Reasoning |
|
Tests for everyday reasoning and social intelligence. |
Multiple Choice QA |
|
Standardized test-style multiple choice questions. |
Reading Comprehension |
|
Answer extraction from passages. |
Code Generation |
|
Write code to solve programming problems. |
Translation |
|
Machine translation between languages. |
Ethics & Truthfulness |
|
Measures model truthfulness and ethical reasoning. |
Winograd/Disambiguation |
|
Tests for pronoun resolution and ambiguity. |
Story/Completion |
|
Predicts the next word or sentence in a story. |
For the full list of LM Harness tasks, see the lm-evaluation-harness tasks directory or run python -m lm_eval --tasks list
.
Math & Reasoning#
GPQA Example#
{
"type": "gpqa",
"name": "my-configuration-lm-harness-gpqa-1",
"namespace": "my-organization",
"tasks": {
"gpqa_diamond_generative_n_shot": {
"type": "gpqa_diamond_generative_n_shot"
}
},
"params": {
"max_tokens": 1024,
"temperature": 1.0,
"top_p": 0.0,
"stop": [
"<|endoftext|>",
"<extra_id_1>"
],
"extra": {
"use_greedy": true,
"top_k": 1
}
}
}
{
"question": "What is the capital of France?",
"choices": ["Paris", "London", "Berlin", "Madrid"],
"answer": "Paris",
"output": "Paris"
}
{
"tasks": {
"gpqa_diamond_generative_n_shot": {
"metrics": {
"accuracy": {
"scores": {
"accuracy": {
"value": 1.0
}
}
}
}
}
}
}
GSM8K Example#
{
"type": "gsm8k",
"name": "my-configuration-lm-harness-gsm8k-1",
"namespace": "my-organization",
"tasks": {
"gsm8k_cot_llama": {
"type": "gsm8k_cot_llama"
}
},
"params": {
"temperature": 0.00001,
"top_p": 0.00001,
"max_tokens": 256,
"stop": ["<|eot|>"],
"extra": {
"num_fewshot": 8,
"batch_size": 16,
"bootstrap_iters": 100000,
"dataset_seed": 42,
"use_greedy": true,
"top_k": 1,
"hf_token": "<my-token>",
"tokenizer_backend": "hf",
"tokenizer": "meta-llama/Llama-3.1-8B-Instruct",
"apply_chat_template": true,
"fewshot_as_multiturn": true
}
}
}
{
"question": "If you have 3 apples and you get 2 more, how many apples do you have?",
"answer": "5",
"output": "5"
}
{
"tasks": {
"gsm8k_cot_llama": {
"metrics": {
"accuracy": {
"scores": {
"accuracy": {
"value": 1.0
}
}
}
}
}
}
}
Options#
Language Understanding (MMLU)#
{
"type": "mmlu",
"name": "my-configuration-lm-harness-mmlu-1",
"namespace": "my-organization",
"tasks": {
"mmlu_abstract_algebra": {
"type": "mmlu_abstract_algebra"
}
},
"params": {
"num_fewshot": 5,
"batch_size": 8
}
}
{
"question": "Which of the following is a prime number?",
"choices": ["4", "6", "7", "8"],
"answer": "7",
"output": "7"
}
{
"tasks": {
"mmlu_abstract_algebra": {
"metrics": {
"accuracy": {
"scores": {
"accuracy": {
"value": 1.0
}
}
}
}
}
}
}
Commonsense Reasoning (HellaSwag)#
{
"type": "hellaswag",
"name": "my-configuration-lm-harness-hellaswag-1",
"namespace": "my-organization",
"tasks": {
"hellaswag": {
"type": "hellaswag"
}
},
"params": {
"num_fewshot": 10,
"batch_size": 8
}
}
{
"ctx_a": "On stage, a woman takes a seat at the piano. She",
"endings": [
"sits on a bench as her sister plays with the doll.",
"smiles with joy as she looks at the audience.",
"nervously sets her fingers on the keys.",
"is in the crowd, watching the dancers."
],
"label": 2,
"output": "nervously sets her fingers on the keys."
}
{
"tasks": {
"hellaswag": {
"metrics": {
"accuracy": {
"scores": {
"accuracy": {
"value": 1.0
}
}
}
}
}
}
}
Multiple Choice QA (ARC)#
{
"type": "arc",
"name": "my-configuration-lm-harness-arc-1",
"namespace": "my-organization",
"tasks": {
"arc_challenge": {
"type": "arc_challenge"
}
},
"params": {
"num_fewshot": 25,
"batch_size": 8
}
}
{
"question": "What is the boiling point of water?",
"choices": ["90°C", "100°C", "110°C", "120°C"],
"answer": "100°C",
"output": "100°C"
}
{
"tasks": {
"arc_challenge": {
"metrics": {
"accuracy": {
"scores": {
"accuracy": {
"value": 1.0
}
}
}
}
}
}
}
Reading Comprehension (SQuAD Completion)#
{
"type": "squad_completion",
"name": "my-configuration-lm-harness-squad-1",
"namespace": "my-organization",
"tasks": {
"squad_completion": {
"type": "squad_completion"
}
},
"params": {
"num_fewshot": 3,
"batch_size": 8
}
}
{
"context": "The Eiffel Tower is located in Paris.",
"question": "Where is the Eiffel Tower located?",
"answers": ["Paris"],
"output": "Paris"
}
{
"tasks": {
"squad_completion": {
"metrics": {
"f1": {
"scores": {
"f1": {
"value": 1.0
}
}
}
}
}
}
}
Code Generation (HumanEval)#
{
"type": "humaneval",
"name": "my-configuration-lm-harness-humaneval-1",
"namespace": "my-organization",
"tasks": {
"humaneval": {
"type": "humaneval"
}
},
"params": {
"num_fewshot": 0,
"batch_size": 4
}
}
{
"prompt": "def add(a, b):\n ",
"test": "assert add(2, 3) == 5",
"output": "return a + b"
}
{
"tasks": {
"humaneval": {
"metrics": {
"pass@1": {
"scores": {
"pass@1": {
"value": 1.0
}
}
}
}
}
}
}
WMT2016 Example (Translation)#
{
"type": "wmt2016",
"name": "my-configuration-lm-harness-wmt2016-1",
"namespace": "my-organization",
"tasks": {
"wmt2016_en_de": {
"type": "wmt2016_en_de"
}
},
"params": {
"num_fewshot": 0,
"batch_size": 8
}
}
{
"source": "The cat sits on the mat.",
"target": "Die Katze sitzt auf der Matte.",
"output": "Die Katze sitzt auf der Matte."
}
{
"tasks": {
"wmt2016_en_de": {
"metrics": {
"bleu": {
"scores": {
"bleu": {
"value": 100.0
}
}
}
}
}
}
}
Ethics & Truthfulness (TruthfulQA)#
{
"type": "truthfulqa",
"name": "my-configuration-lm-harness-truthfulqa-1",
"namespace": "my-organization",
"tasks": {
"truthfulqa_mc": {
"type": "truthfulqa_mc"
}
},
"params": {
"num_fewshot": 0,
"batch_size": 8
}
}
{
"question": "What is the boiling point of water?",
"choices": ["90°C", "100°C", "110°C", "120°C"],
"answer": "100°C",
"output": "100°C"
}
{
"tasks": {
"truthfulqa_mc": {
"metrics": {
"accuracy": {
"scores": {
"accuracy": {
"value": 1.0
}
}
}
}
}
}
}
Disambiguation (Winogrande)#
{
"type": "winogrande",
"name": "my-configuration-lm-harness-winogrande-1",
"namespace": "my-organization",
"tasks": {
"winogrande": {
"type": "winogrande"
}
},
"params": {
"num_fewshot": 5,
"batch_size": 8
}
}
{
"sentence": "The trophy would not fit in the brown suitcase because it was too big. What was too big?",
"options": ["the trophy", "the suitcase"],
"answer": "the trophy",
"output": "the trophy"
}
{
"tasks": {
"winogrande": {
"metrics": {
"accuracy": {
"scores": {
"accuracy": {
"value": 1.0
}
}
}
}
}
}
}
Story/Completion (LAMBADA)#
{
"type": "lambada",
"name": "my-configuration-lm-harness-lambada-1",
"namespace": "my-organization",
"tasks": {
"lambada": {
"type": "lambada"
}
},
"params": {
"num_fewshot": 0,
"batch_size": 8
}
}
{
"context": "She opened the door and saw a ",
"target": "cat",
"output": "cat"
}
{
"tasks": {
"lambada": {
"metrics": {
"accuracy": {
"scores": {
"accuracy": {
"value": 1.0
}
}
}
}
}
}
}
For the full list of LM Harness tasks, refer to tasks.
Parameters#
You can set the following task-specific parameters in the params.extra
section of an LM Harness config:
Name |
Description |
Type |
Valid Values or Child Objects |
---|---|---|---|
apply_chat_template |
Specify whether to apply a chat template to the prompt. You can specify |
Boolean or String |
|
batch_size |
The batch size for the model. |
Integer |
— |
bootstrap_iters |
The number of iterations for bootstrap statistics when calculating stderrs. Specify 0 for no stderr calculations. |
Integer |
— |
dataset_seed |
A random seed for dataset shuffling. |
Integer |
— |
fewshot_as_multiturn |
|
Boolean |
|
hf_token |
A Hugging Face account token to access tokenizers that require authenticated or authorized access. |
String |
— |
num_fewshot |
The number of examples in few-shot context. |
Integer |
— |
seed |
A random seed for Python’s random, numpy and torch. Accepts a comma-separated list of 3 values for Python’s random, numpy, and torch seeds, respectively. Specify a single integer to set the same seed for all three, for example |
— |
— |
tokenizer |
A path to the custom tokenizer to use as a benchmark. |
String |
— |
tokenizer_backend |
A backend store to use for loading tokenizer. |
String |
|
Metrics#
Metric Name |
Description |
Value Range |
Notes |
---|---|---|---|
|
Accuracy (fraction of correct predictions) |
|
Most common for classification tasks |
|
Length-normalized accuracy |
|
Normalizes for answer length |
|
Baseline loglikelihood - normalized accuracy |
Task-dependent |
Used in some specialized tasks |
|
Perplexity (measure of model uncertainty) |
|
Lower is better |
|
Perplexity per word |
|
Lower is better |
|
Perplexity per byte |
|
Lower is better |
|
Bits per byte |
|
Lower is better |
|
Matthews correlation coefficient |
|
For binary/multiclass classification |
|
F1 score (harmonic mean of precision and recall) |
|
For classification/QA tasks |
|
BLEU score (text generation quality) |
|
For translation/generation tasks |
|
Character F-score (CHRF) |
|
For translation/generation tasks |
|
Translation Edit Rate (TER) |
|
For translation tasks |
Not all metrics are available for every task. Check the task definition for the exact metrics used.