LM Harness Evaluation Type#
LM Evaluation Harness supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8K, and IFEval. Use this evaluation type to benchmark general language understanding and reasoning tasks.
Prerequisites#
Target Configuration#
Set up or select an existing evaluation target. All LM Harness evaluations use the same target structure. Here’s an example targeting a NIM endpoint:
{
"target": {
"type": "model",
"model": {
"api_endpoint": {
"url": "https://nim.int.aire.nvidia.com/chat/completions",
"model_id": "meta/llama-3.3-70b-instruct"
}
}
}
}
Field |
Description |
Required |
Default |
---|---|---|---|
|
Always |
Yes |
— |
|
The URL of the API endpoint for the model. |
Yes |
— |
|
The model identifier. |
Yes |
— |
|
Whether to use streaming responses. |
No |
|
Supported Tasks#
Category |
Example Task(s) |
Description |
---|---|---|
Advanced Reasoning |
|
Big-Bench Hard, multistep reasoning, and graduate-level Q&A tasks. |
Instruction Following |
|
Tests ability to follow specific instructions. |
Language Understanding |
|
Massive Multitask Language Understanding; covers 57 subjects across STEM, humanities, and more. |
Math & Reasoning |
|
Grade school and advanced math word problems. |
Multilingual Tasks |
|
Math word problems and translation tasks in multiple languages. |
For the full list of LM Harness tasks, see the lm-evaluation-harness tasks directory or run python -m lm_eval --tasks list
.
Advanced Reasoning (GPQA)#
{
"type": "gpqa_diamond_cot",
"name": "my-configuration-lm-harness-gpqa-diamond-cot-1",
"namespace": "my-organization",
"params": {
"max_tokens": 1024,
"temperature": 1.0,
"top_p": 0.01,
"extra": {
"model_type": "chat",
"hf_token": "hf_your_token_here",
}
}
}
{
"question": "What is the capital of France?",
"choices": ["Paris", "London", "Berlin", "Madrid"],
"answer": "Paris",
"output": "Paris"
}
{
"tasks": {
"gpqa_diamond_cot_zeroshot": {
"metrics": {
"exact_match__flexible-extract": {
"scores": {
"exact_match__flexible-extract": {
"value": 1.0
}
}
}
}
}
}
}
Instruction Following (IFEval)#
{
"type": "ifeval",
"name": "my-configuration-lm-harness-ifeval-1",
"namespace": "my-organization",
"params": {
"max_retries": 5,
"parallelism": 10,
"request_timeout": 300,
"limit_samples": 50,
"temperature": 1.0,
"top_p": 0.01,
"max_tokens": 1024,
"extra": {
"model_type": "chat",
"hf_token": "hf_your_token_here",
"tokenizer": "meta-llama/Llama-3.2-1B-Instruct",
"tokenizer_backend": "huggingface"
}
}
}
{
"prompt": "Write a short story about a cat. The story must be exactly 3 sentences long.",
"instruction_id_list": ["length_constraints:number_sentences"],
"kwargs": [{"num_sentences": 3}],
"output": "The cat sat by the window. It watched the birds outside. Then it fell asleep in the warm sunlight."
}
{
"tasks": {
"ifeval": {
"metrics": {
"prompt_level_strict_acc": {
"scores": {
"prompt_level_strict_acc": {
"value": 1.0
}
}
}
}
}
}
}
Language Understanding (MMLU)#
{
"type": "mmlu",
"name": "my-configuration-lm-harness-mmlu-1",
"namespace": "my-organization",
"params": {
"extra": {
"model_type": "completions",
"num_fewshot": 5,
"hf_token": "hf_your_token_here",
"tokenizer": "meta-llama/Llama-3.3-70B-Instruct",
"tokenizer_backend": "huggingface"
}
}
}
{
"question": "Which of the following is a prime number?",
"choices": ["4", "6", "7", "8"],
"answer": "7",
"output": "7"
}
{
"tasks": {
"mmlu_abstract_algebra": {
"metrics": {
"accuracy": {
"scores": {
"accuracy": {
"value": 1.0
}
}
}
}
}
}
}
Math & Reasoning (GSM8K)#
{
"type": "gsm8k",
"name": "my-configuration-lm-harness-gsm8k-1",
"namespace": "my-organization",
"params": {
"temperature": 1.0,
"top_p": 0.01,
"max_tokens": 256,
"parallelism": 10,
"extra": {
"model_type": "completions",
"num_fewshot": 8,
"hf_token": "hf_your_token_here",
"tokenizer": "meta-llama/Llama-3.3-70B-Instruct",
"tokenizer_backend": "huggingface"
}
}
}
{
"question": "If you have 3 apples and you get 2 more, how many apples do you have?",
"answer": "5",
"output": "5"
}
{
"tasks": {
"gsm8k": {
"metrics": {
"accuracy": {
"scores": {
"accuracy": {
"value": 1.0
}
}
}
}
}
}
}
For the full list of LM Harness tasks, refer to tasks.
Parameters#
Request Parameters#
These parameters control how requests are made to the model:
Name |
Description |
Type |
Default |
Valid Values |
---|---|---|---|---|
|
Maximum number of retries for failed requests. |
Integer |
Container default |
— |
|
Number of parallel requests to improve throughput. |
Integer |
Container default |
— |
|
Timeout in seconds for each request. |
Integer |
Container default |
— |
|
Limit the number of samples to evaluate. Useful for testing. |
Integer |
|
— |
Model Parameters#
These parameters control the model’s generation behavior:
Name |
Description |
Type |
Default |
Valid Values |
---|---|---|---|---|
|
Sampling temperature for generation. |
Float |
Container default |
|
|
Nucleus sampling parameter. |
Float |
0.01 |
|
|
Maximum number of tokens to generate. |
Integer |
Container default |
— |
|
Up to 4 sequences where the API will stop generating further tokens. |
Array of strings |
— |
— |
Extra Parameters#
Set these parameters in the params.extra
section:
Name |
Description |
Type |
Default |
Valid Values |
---|---|---|---|---|
|
Type of model interface to use. Required for the underlying container, but Evaluator will attempt to guess if not provided. |
String |
Autodetected |
|
|
HuggingFace token for accessing datasets and tokenizers. Required for tasks that fetch from HuggingFace. |
String |
— |
Valid HF token |
|
Path to the tokenizer model. If missing, will attempt to use the target model from HuggingFace. |
String |
Target model name |
HuggingFace model path |
|
System for loading the tokenizer. |
String |
|
|
|
Number of examples in few-shot context. |
Integer |
Task-dependent |
— |
|
Whether to use tokenized requests. |
Boolean |
|
|
|
Ratio for downsampling the dataset. |
Float |
|
|
Important Notes#
model_type: Different tasks support different model types. Tasks that support both “completions” and “chat” will default to “chat.” If no preference is detected, defaults to “completions.”
hf_token: Required for tasks that fetch datasets or tokenizers from HuggingFace. Errors will appear in run logs if missing or insufficient permissions.
tokenizer: Some tasks require a tokenizer. NVIDIA internal model names are often lowercase while HuggingFace models use different casing, which can cause failures if not specified correctly.
Metrics#
Metric Name |
Description |
Value Range |
Notes |
---|---|---|---|
|
Accuracy (fraction of correct predictions) |
|
Most common for classification tasks |
|
Length-normalized accuracy |
|
Normalizes for answer length |
|
Baseline loglikelihood - normalized accuracy |
Task-dependent |
Used in some specialized tasks |
|
Perplexity (measure of model uncertainty) |
|
Lower is better |
|
Perplexity per word |
|
Lower is better |
|
Perplexity per byte |
|
Lower is better |
|
Bits per byte |
|
Lower is better |
|
Matthews correlation coefficient |
|
For binary/multiclass classification |
|
F1 score (harmonic mean of precision and recall) |
|
For classification/QA tasks |
|
BLEU score (text generation quality) |
|
For translation/generation tasks |
|
Character F-score (CHRF) |
|
For translation/generation tasks |
|
Translation Edit Rate (TER) |
|
For translation tasks |
|
Prompt-level strict accuracy for instruction following |
|
For instruction following tasks like IFEval |
|
Pass rate for code generation (first attempt) |
|
For code generation tasks |
Not all metrics are available for every task. Check the task definition for the exact metrics used.