LM Harness Evaluation Type#

LM Evaluation Harness supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8K, and HellaSwag. Use this evaluation type to benchmark general language understanding and reasoning tasks.

Prerequisites#

Supported Tasks#

Example LM Harness Tasks by Category#

Category

Example Task(s)

Description

Language Understanding

mmlu

Massive Multitask Language Understanding; covers 57 subjects across STEM, humanities, and more.

Math & Reasoning

gsm8k, mathqa

Grade school and advanced math word problems.

Commonsense Reasoning

hellaswag, siqa

Tests for everyday reasoning and social intelligence.

Multiple Choice QA

arc, openbookqa, race

Standardized test-style multiple choice questions.

Reading Comprehension

squad_completion, drop

Answer extraction from passages.

Code Generation

humaneval, mbpp

Write code to solve programming problems.

Translation

wmt2016, belebele

Machine translation between languages.

Ethics & Truthfulness

truthfulqa, hendrycks_ethics

Measures model truthfulness and ethical reasoning.

Winograd/Disambiguation

winogrande, wsc273

Tests for pronoun resolution and ambiguity.

Story/Completion

lambada, storycloze

Predicts the next word or sentence in a story.

For the full list of LM Harness tasks, see the lm-evaluation-harness tasks directory or run python -m lm_eval --tasks list.

Math & Reasoning#

GPQA Example#

{
    "type": "gpqa",
    "name": "my-configuration-lm-harness-gpqa-1",
    "namespace": "my-organization",
    "tasks": {
        "gpqa_diamond_generative_n_shot": {
            "type": "gpqa_diamond_generative_n_shot"
        }
    },
    "params": {
        "max_tokens": 1024,
        "temperature": 1.0,
        "top_p": 0.0,
        "stop": [
            "<|endoftext|>",
            "<extra_id_1>"
        ],
        "extra": {
            "use_greedy": true,
            "top_k": 1
        }
    }
}
{
  "question": "What is the capital of France?",
  "choices": ["Paris", "London", "Berlin", "Madrid"],
  "answer": "Paris",
  "output": "Paris"
}
{
  "tasks": {
    "gpqa_diamond_generative_n_shot": {
      "metrics": {
        "accuracy": {
          "scores": {
            "accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

GSM8K Example#

{
    "type": "gsm8k",
    "name": "my-configuration-lm-harness-gsm8k-1",
    "namespace": "my-organization",
    "tasks": {
        "gsm8k_cot_llama": {
            "type": "gsm8k_cot_llama"
        }
    },
    "params": {
        "temperature": 0.00001,      
        "top_p": 0.00001,
        "max_tokens": 256,
        "stop": ["<|eot|>"],
        "extra": {
            "num_fewshot": 8,
            "batch_size": 16,
            "bootstrap_iters": 100000,
            "dataset_seed": 42,
            "use_greedy": true,
            "top_k": 1,
            "hf_token": "<my-token>",
            "tokenizer_backend": "hf",
            "tokenizer": "meta-llama/Llama-3.1-8B-Instruct",
            "apply_chat_template": true,
            "fewshot_as_multiturn": true
        }
    }
}
{
  "question": "If you have 3 apples and you get 2 more, how many apples do you have?",
  "answer": "5",
  "output": "5"
}
{
  "tasks": {
    "gsm8k_cot_llama": {
      "metrics": {
        "accuracy": {
          "scores": {
            "accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Options#

Language Understanding (MMLU)#

{
    "type": "mmlu",
    "name": "my-configuration-lm-harness-mmlu-1",
    "namespace": "my-organization",
    "tasks": {
        "mmlu_abstract_algebra": {
            "type": "mmlu_abstract_algebra"
        }
    },
    "params": {
        "num_fewshot": 5,
        "batch_size": 8
    }
}
{
  "question": "Which of the following is a prime number?",
  "choices": ["4", "6", "7", "8"],
  "answer": "7",
  "output": "7"
}
{
  "tasks": {
    "mmlu_abstract_algebra": {
      "metrics": {
        "accuracy": {
          "scores": {
            "accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Commonsense Reasoning (HellaSwag)#

{
    "type": "hellaswag",
    "name": "my-configuration-lm-harness-hellaswag-1",
    "namespace": "my-organization",
    "tasks": {
        "hellaswag": {
            "type": "hellaswag"
        }
    },
    "params": {
        "num_fewshot": 10,
        "batch_size": 8
    }
}
{
  "ctx_a": "On stage, a woman takes a seat at the piano. She",
  "endings": [
    "sits on a bench as her sister plays with the doll.",
    "smiles with joy as she looks at the audience.",
    "nervously sets her fingers on the keys.",
    "is in the crowd, watching the dancers."
  ],
  "label": 2,
  "output": "nervously sets her fingers on the keys."
}
{
  "tasks": {
    "hellaswag": {
      "metrics": {
        "accuracy": {
          "scores": {
            "accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Multiple Choice QA (ARC)#

{
    "type": "arc",
    "name": "my-configuration-lm-harness-arc-1",
    "namespace": "my-organization",
    "tasks": {
        "arc_challenge": {
            "type": "arc_challenge"
        }
    },
    "params": {
        "num_fewshot": 25,
        "batch_size": 8
    }
}
{
  "question": "What is the boiling point of water?",
  "choices": ["90°C", "100°C", "110°C", "120°C"],
  "answer": "100°C",
  "output": "100°C"
}
{
  "tasks": {
    "arc_challenge": {
      "metrics": {
        "accuracy": {
          "scores": {
            "accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Reading Comprehension (SQuAD Completion)#

{
    "type": "squad_completion",
    "name": "my-configuration-lm-harness-squad-1",
    "namespace": "my-organization",
    "tasks": {
        "squad_completion": {
            "type": "squad_completion"
        }
    },
    "params": {
        "num_fewshot": 3,
        "batch_size": 8
    }
}
{
  "context": "The Eiffel Tower is located in Paris.",
  "question": "Where is the Eiffel Tower located?",
  "answers": ["Paris"],
  "output": "Paris"
}
{
  "tasks": {
    "squad_completion": {
      "metrics": {
        "f1": {
          "scores": {
            "f1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Code Generation (HumanEval)#

{
    "type": "humaneval",
    "name": "my-configuration-lm-harness-humaneval-1",
    "namespace": "my-organization",
    "tasks": {
        "humaneval": {
            "type": "humaneval"
        }
    },
    "params": {
        "num_fewshot": 0,
        "batch_size": 4
    }
}
{
  "prompt": "def add(a, b):\n    ",
  "test": "assert add(2, 3) == 5",
  "output": "return a + b"
}
{
  "tasks": {
    "humaneval": {
      "metrics": {
        "pass@1": {
          "scores": {
            "pass@1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

WMT2016 Example (Translation)#

{
    "type": "wmt2016",
    "name": "my-configuration-lm-harness-wmt2016-1",
    "namespace": "my-organization",
    "tasks": {
        "wmt2016_en_de": {
            "type": "wmt2016_en_de"
        }
    },
    "params": {
        "num_fewshot": 0,
        "batch_size": 8
    }
}
{
  "source": "The cat sits on the mat.",
  "target": "Die Katze sitzt auf der Matte.",
  "output": "Die Katze sitzt auf der Matte."
}
{
  "tasks": {
    "wmt2016_en_de": {
      "metrics": {
        "bleu": {
          "scores": {
            "bleu": {
              "value": 100.0
            }
          }
        }
      }
    }
  }
}

Ethics & Truthfulness (TruthfulQA)#

{
    "type": "truthfulqa",
    "name": "my-configuration-lm-harness-truthfulqa-1",
    "namespace": "my-organization",
    "tasks": {
        "truthfulqa_mc": {
            "type": "truthfulqa_mc"
        }
    },
    "params": {
        "num_fewshot": 0,
        "batch_size": 8
    }
}
{
  "question": "What is the boiling point of water?",
  "choices": ["90°C", "100°C", "110°C", "120°C"],
  "answer": "100°C",
  "output": "100°C"
}
{
  "tasks": {
    "truthfulqa_mc": {
      "metrics": {
        "accuracy": {
          "scores": {
            "accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Disambiguation (Winogrande)#

{
    "type": "winogrande",
    "name": "my-configuration-lm-harness-winogrande-1",
    "namespace": "my-organization",
    "tasks": {
        "winogrande": {
            "type": "winogrande"
        }
    },
    "params": {
        "num_fewshot": 5,
        "batch_size": 8
    }
}
{
  "sentence": "The trophy would not fit in the brown suitcase because it was too big. What was too big?",
  "options": ["the trophy", "the suitcase"],
  "answer": "the trophy",
  "output": "the trophy"
}
{
  "tasks": {
    "winogrande": {
      "metrics": {
        "accuracy": {
          "scores": {
            "accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Story/Completion (LAMBADA)#

{
    "type": "lambada",
    "name": "my-configuration-lm-harness-lambada-1",
    "namespace": "my-organization",
    "tasks": {
        "lambada": {
            "type": "lambada"
        }
    },
    "params": {
        "num_fewshot": 0,
        "batch_size": 8
    }
}
{
  "context": "She opened the door and saw a ",
  "target": "cat",
  "output": "cat"
}
{
  "tasks": {
    "lambada": {
      "metrics": {
        "accuracy": {
          "scores": {
            "accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

For the full list of LM Harness tasks, refer to tasks.


Parameters#

You can set the following task-specific parameters in the params.extra section of an LM Harness config:

Name

Description

Type

Valid Values or Child Objects

apply_chat_template

Specify whether to apply a chat template to the prompt. You can specify true to apply the default chat template, false to apply no chat template, or the name of a chat template.

Boolean or String

true, false, or a chat template name

batch_size

The batch size for the model.

Integer

bootstrap_iters

The number of iterations for bootstrap statistics when calculating stderrs. Specify 0 for no stderr calculations.

Integer

dataset_seed

A random seed for dataset shuffling.

Integer

fewshot_as_multiturn

true to provide the fewshot examples as a multiturn conversation, false for a single-user turn.

Boolean

true, false

hf_token

A Hugging Face account token to access tokenizers that require authenticated or authorized access.

String

num_fewshot

The number of examples in few-shot context.

Integer

seed

A random seed for Python’s random, numpy and torch. Accepts a comma-separated list of 3 values for Python’s random, numpy, and torch seeds, respectively. Specify a single integer to set the same seed for all three, for example 18. The values are either an integer or ‘None’ to not set the seed, for example 0,None,8. For backward compatibility, the default is 0,1234,1234.

tokenizer

A path to the custom tokenizer to use as a benchmark.

String

tokenizer_backend

A backend store to use for loading tokenizer.

String

hf for tokenizers from HuggingFace or tiktoken (default).

Metrics#

Core Supported Metrics in LM Evaluation Harness#

Metric Name

Description

Value Range

Notes

acc

Accuracy (fraction of correct predictions)

0–1

Most common for classification tasks

acc_norm

Length-normalized accuracy

0–1

Normalizes for answer length

acc_mutual_info

Baseline loglikelihood - normalized accuracy

Task-dependent

Used in some specialized tasks

perplexity

Perplexity (measure of model uncertainty)

>0

Lower is better

word_perplexity

Perplexity per word

>0

Lower is better

byte_perplexity

Perplexity per byte

>0

Lower is better

bits_per_byte

Bits per byte

>0

Lower is better

matthews_corrcoef

Matthews correlation coefficient

-1-1

For binary/multiclass classification

f1

F1 score (harmonic mean of precision and recall)

0–1

For classification/QA tasks

bleu

BLEU score (text generation quality)

0–100

For translation/generation tasks

chrf

Character F-score (CHRF)

0–100

For translation/generation tasks

ter

Translation Edit Rate (TER)

0–1

For translation tasks

Not all metrics are available for every task. Check the task definition for the exact metrics used.