LM Harness Evaluation Type#

LM Evaluation Harness supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8K, and IFEval. Use this evaluation type to benchmark general language understanding and reasoning tasks.

Prerequisites#

Target Configuration#

Set up or select an existing evaluation target. All LM Harness evaluations use the same target structure. Here’s an example targeting a NIM endpoint:

{
  "target": {
    "type": "model",
    "model": {
      "api_endpoint": {
        "url": "https://nim.int.aire.nvidia.com/chat/completions",
        "model_id": "meta/llama-3.3-70b-instruct"
      }
    }
  }
}

Field	Description	Required	Default
`type`	Always `"model"` for LM Harness evaluations.	Yes	—
`url`	The URL of the API endpoint for the model.	Yes	—
`model_id`	The model identifier.	Yes	—
`stream`	Whether to use streaming responses.	No	`false`

Supported Tasks#

Example LM Harness Tasks by Category#
Category	Example Task(s)	Description
Advanced Reasoning	`bbh`, `musr`, `gpqa_diamond_cot`	Big-Bench Hard, multistep reasoning, and graduate-level Q&A tasks.
Instruction Following	`ifeval`	Tests ability to follow specific instructions.
Language Understanding	`mmlu`	Massive Multitask Language Understanding; covers 57 subjects across STEM, humanities, and more.
Math & Reasoning	`gsm8k`	Grade school and advanced math word problems.
Multilingual Tasks	`mgsm`, `wikilingua`	Math word problems and translation tasks in multiple languages.

For the full list of LM Harness tasks, see the lm-evaluation-harness tasks directory or run python -m lm_eval --tasks list.

Advanced Reasoning (GPQA)#

Config

{
  "type": "gpqa_diamond_cot",
  "name": "my-configuration-lm-harness-gpqa-diamond-cot-1",
  "namespace": "my-organization",
  "params": {
    "max_tokens": 1024,
    "temperature": 1.0,
    "top_p": 0.01,
    "extra": {
      "model_type": "chat",
      "hf_token": "hf_your_token_here",
    }
  }
}

Data Format

{
  "question": "What is the capital of France?",
  "choices": ["Paris", "London", "Berlin", "Madrid"],
  "answer": "Paris",
  "output": "Paris"
}

Result

{
  "tasks": {
    "gpqa_diamond_cot_zeroshot": {
      "metrics": {
        "exact_match__flexible-extract": {
          "scores": {
            "exact_match__flexible-extract": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Instruction Following (IFEval)#

Config

{
  "type": "ifeval",
  "name": "my-configuration-lm-harness-ifeval-1",
  "namespace": "my-organization",
  "params": {
    "max_retries": 5,
    "parallelism": 10,
    "request_timeout": 300,
    "limit_samples": 50,
    "temperature": 1.0,
    "top_p": 0.01,
    "max_tokens": 1024,
    "extra": {
      "model_type": "chat",
      "hf_token": "hf_your_token_here",
      "tokenizer": "meta-llama/Llama-3.2-1B-Instruct",
      "tokenizer_backend": "huggingface"
    }
  }
}

Data Format

{
  "prompt": "Write a short story about a cat. The story must be exactly 3 sentences long.",
  "instruction_id_list": ["length_constraints:number_sentences"],
  "kwargs": [{"num_sentences": 3}],
  "output": "The cat sat by the window. It watched the birds outside. Then it fell asleep in the warm sunlight."
}

Result

{
  "tasks": {
    "ifeval": {
      "metrics": {
        "prompt_level_strict_acc": {
          "scores": {
            "prompt_level_strict_acc": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Language Understanding (MMLU)#

Config

{
  "type": "mmlu",
  "name": "my-configuration-lm-harness-mmlu-1",
  "namespace": "my-organization",
  "params": {
    "extra": {
      "model_type": "completions",
      "num_fewshot": 5,
      "hf_token": "hf_your_token_here",
      "tokenizer": "meta-llama/Llama-3.3-70B-Instruct",
      "tokenizer_backend": "huggingface"
    }
  }
}

Data Format

{
  "question": "Which of the following is a prime number?",
  "choices": ["4", "6", "7", "8"],
  "answer": "7",
  "output": "7"
}

Result

{
  "tasks": {
    "mmlu_abstract_algebra": {
      "metrics": {
        "accuracy": {
          "scores": {
            "accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Math & Reasoning (GSM8K)#

Config

{
  "type": "gsm8k",
  "name": "my-configuration-lm-harness-gsm8k-1",
  "namespace": "my-organization",
  "params": {
    "temperature": 1.0,
    "top_p": 0.01,
    "max_tokens": 256,
    "parallelism": 10,
    "extra": {
      "model_type": "completions",
      "num_fewshot": 8,
      "hf_token": "hf_your_token_here",
      "tokenizer": "meta-llama/Llama-3.3-70B-Instruct",
      "tokenizer_backend": "huggingface"
    }
  }
}

Data Format

{
  "question": "If you have 3 apples and you get 2 more, how many apples do you have?",
  "answer": "5",
  "output": "5"
}

Result

{
  "tasks": {
    "gsm8k": {
      "metrics": {
        "accuracy": {
          "scores": {
            "accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

For the full list of LM Harness tasks, refer to tasks.

Parameters#

Request Parameters#

These parameters control how requests are made to the model:

Name	Description	Type	Default	Valid Values
`max_retries`	Maximum number of retries for failed requests.	Integer	Container default	—
`parallelism`	Number of parallel requests to improve throughput.	Integer	Container default	—
`request_timeout`	Timeout in seconds for each request.	Integer	Container default	—
`limit_samples`	Limit the number of samples to evaluate. Useful for testing.	Integer	`null` (all samples)	—

Model Parameters#

These parameters control the model’s generation behavior:

Name	Description	Type	Default	Valid Values
`temperature`	Sampling temperature for generation.	Float	Container default	`0.0–2.0`
`top_p`	Nucleus sampling parameter.	Float	0.01	`0.01–1.0`
`max_tokens`	Maximum number of tokens to generate.	Integer	Container default	—
`stop`	Up to 4 sequences where the API will stop generating further tokens.	Array of strings	—	—

Extra Parameters#

Set these parameters in the params.extra section:

Name	Description	Type	Default	Valid Values
`model_type`	Type of model interface to use. Required for the underlying container, but Evaluator will attempt to guess if not provided.	String	Autodetected	`"completions"`, `"chat"`, `"vlm"`
`hf_token`	HuggingFace token for accessing datasets and tokenizers. Required for tasks that fetch from HuggingFace.	String	—	Valid HF token
`tokenizer`	Path to the tokenizer model. If missing, will attempt to use the target model from HuggingFace.	String	Target model name	HuggingFace model path
`tokenizer_backend`	System for loading the tokenizer.	String	`"huggingface"`	`"huggingface"`, `"tiktoken"`
`num_fewshot`	Number of examples in few-shot context.	Integer	Task-dependent	—
`tokenized_requests`	Whether to use tokenized requests.	Boolean	`false`	`true`, `false`
`downsampling_ratio`	Ratio for downsampling the dataset.	Float	`null` (no downsampling)	`0.0–1.0`

Important Notes#

model_type: Different tasks support different model types. Tasks that support both “completions” and “chat” will default to “chat.” If no preference is detected, defaults to “completions.”
hf_token: Required for tasks that fetch datasets or tokenizers from HuggingFace. Errors will appear in run logs if missing or insufficient permissions.
tokenizer: Some tasks require a tokenizer. NVIDIA internal model names are often lowercase while HuggingFace models use different casing, which can cause failures if not specified correctly.

Metrics#

Core Supported Metrics in LM Evaluation Harness#
Metric Name	Description	Value Range	Notes
`acc`	Accuracy (fraction of correct predictions)	`0–1`	Most common for classification tasks
`acc_norm`	Length-normalized accuracy	`0–1`	Normalizes for answer length
`acc_mutual_info`	Baseline loglikelihood - normalized accuracy	Task-dependent	Used in some specialized tasks
`perplexity`	Perplexity (measure of model uncertainty)	`>0`	Lower is better
`word_perplexity`	Perplexity per word	`>0`	Lower is better
`byte_perplexity`	Perplexity per byte	`>0`	Lower is better
`bits_per_byte`	Bits per byte	`>0`	Lower is better
`matthews_corrcoef`	Matthews correlation coefficient	`-1-1`	For binary/multiclass classification
`f1`	F1 score (harmonic mean of precision and recall)	`0–1`	For classification/QA tasks
`bleu`	BLEU score (text generation quality)	`0–100`	For translation/generation tasks
`chrf`	Character F-score (CHRF)	`0–100`	For translation/generation tasks
`ter`	Translation Edit Rate (TER)	`0–1`	For translation tasks
`prompt_level_strict_acc`	Prompt-level strict accuracy for instruction following	`0–1`	For instruction following tasks like IFEval
`pass@1`	Pass rate for code generation (first attempt)	`0–1`	For code generation tasks

Not all metrics are available for every task. Check the task definition for the exact metrics used.