LM Harness Evaluations#

LM Evaluation Harness supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8K, and IFEval. Use this evaluation type to benchmark general language understanding and reasoning tasks.

Tip

Want to experiment first? You can try these benchmarks using the open-source NeMo Evaluator SDK before deploying the microservice. The SDK provides a lightweight way to test evaluation workflows locally.

Target Configuration

All LM Harness evaluations use the same target structure. Here’s an example targeting a NIM endpoint:

{
  "target": {
    "type": "model",
    "model": {
      "api_endpoint": {
        "url": "https://<nim-base-url>/v1/chat/completions",
        "model_id": "meta/llama-3.3-70b-instruct"
      }
    }
  }
}

Field	Description	Required	Default
`type`	Always `"model"` for LM Harness evaluations.	Yes	—
`url`	The URL of the API endpoint for the model.	Yes	—
`model_id`	The model identifier.	Yes	—
`stream`	Whether to use streaming responses.	No	`false`

Example Job Execution#

You can execute an Evaluation Job using either the Python SDK or cURL as follows, replacing <my-eval-config> with configs shown on this page:

Note

See Job Target and Configuration Matrix for details on target / config compatibility.

Python SDK v2

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.v2.evaluation.jobs.create(
    spec={
        "target": {
            "type": "model",
            "name": "my-target-dataset-1",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            },
        },
        "config": <my-eval-config>
    }
)

cURL v2

curl -X "POST" "$EVALUATOR_BASE_URL/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "spec": {
            "target": {
                "type": "model",
                "name": "my-target-dataset-1",
                "namespace": "my-organization",
                "model": {
                    "api_endpoint": {
                        # Replace NIM_BASE_URL with your specific deployment
                        "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                        "model_id": "meta/llama-3.1-8b-instruct"
                    }
                }
            },
            "config": <my-eval-config>
        }
    }'

Python SDK v1

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.evaluation.jobs.create(
    namespace="my-organization",
    target={
        "type": "model",
        "namespace": "my-organization",
        "model": {
            "api_endpoint": {
                # Replace NIM_BASE_URL with your specific deployment
                "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                "model_id": "meta/llama-3.1-8b-instruct"
            }
        },
    },
    config=<my-eval-config>
)

cURL v1

curl -X "POST" "$EVALUATOR_BASE_URL/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": {
            "type": "model",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            }
        },
        "config": <my-eval-config>
    }'

For a full example, see Run an Academic LM Harness Eval

Supported Tasks#

Example LM Harness Tasks by Category#
Category	Example Task(s)	Description
Advanced Reasoning	`bbh`, `musr`, `gpqa_diamond_cot`	Big-Bench Hard, multistep reasoning, and graduate-level Q&A tasks.
Instruction Following	`ifeval`	Tests ability to follow specific instructions.
Language Understanding	`mmlu`	Massive Multitask Language Understanding; covers 57 subjects across STEM, humanities, and more.
Math & Reasoning	`gsm8k`	Grade school and advanced math word problems.
Multilingual Tasks	`mgsm`, `wikilingua`	Math word problems and translation tasks in multiple languages.

For the full list of LM Harness tasks, see the lm-evaluation-harness tasks directory or run python -m lm_eval --tasks list.

Advanced Reasoning (GPQA)#

Config

{
  "type": "gpqa_diamond_cot",
  "params": {
    "max_tokens": 1024,
    "temperature": 1.0,
    "top_p": 0.01,
    "extra": {
      "model_type": "chat",
      "hf_token": "hf_your_token_here",
    }
  }
}

Data Format

{
  "question": "What is the capital of France?",
  "choices": ["Paris", "London", "Berlin", "Madrid"],
  "answer": "Paris",
  "output": "Paris"
}

Result

{
  "tasks": {
    "gpqa_diamond_cot_zeroshot": {
      "metrics": {
        "exact_match__flexible-extract": {
          "scores": {
            "exact_match__flexible-extract": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Instruction Following (IFEval)#

Config

{
  "type": "ifeval",
  "params": {
    "max_retries": 5,
    "parallelism": 10,
    "request_timeout": 300,
    "limit_samples": 50,
    "temperature": 1.0,
    "top_p": 0.01,
    "max_tokens": 1024,
    "extra": {
      "model_type": "chat",
      "hf_token": "hf_your_token_here",
      "tokenizer": "meta-llama/Llama-3.2-1B-Instruct",
      "tokenizer_backend": "huggingface"
    }
  }
}

Data Format

{
  "prompt": "Write a short story about a cat. The story must be exactly 3 sentences long.",
  "instruction_id_list": ["length_constraints:number_sentences"],
  "kwargs": [{"num_sentences": 3}],
  "output": "The cat sat by the window. It watched the birds outside. Then it fell asleep in the warm sunlight."
}

Result

{
  "tasks": {
    "ifeval": {
      "metrics": {
        "prompt_level_strict_acc": {
          "scores": {
            "prompt_level_strict_acc": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Language Understanding (MMLU)#

Config

{
  "type": "mmlu",
  "params": {
    "extra": {
      "model_type": "completions",
      "num_fewshot": 5,
      "hf_token": "hf_your_token_here",
      "tokenizer": "meta-llama/Llama-3.3-70B-Instruct",
      "tokenizer_backend": "huggingface"
    }
  }
}

Data Format

{
  "question": "Which of the following is a prime number?",
  "choices": ["4", "6", "7", "8"],
  "answer": "7",
  "output": "7"
}

Result

{
  "tasks": {
    "mmlu_abstract_algebra": {
      "metrics": {
        "accuracy": {
          "scores": {
            "accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Math & Reasoning (GSM8K)#

Config

{
  "type": "gsm8k",
  "params": {
    "temperature": 1.0,
    "top_p": 0.01,
    "max_tokens": 256,
    "parallelism": 10,
    "extra": {
      "model_type": "completions",
      "num_fewshot": 8,
      "hf_token": "hf_your_token_here",
      "tokenizer": "meta-llama/Llama-3.3-70B-Instruct",
      "tokenizer_backend": "huggingface"
    }
  }
}

Data Format

{
  "question": "If you have 3 apples and you get 2 more, how many apples do you have?",
  "answer": "5",
  "output": "5"
}

Result

{
  "tasks": {
    "gsm8k": {
      "metrics": {
        "accuracy": {
          "scores": {
            "accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

For the full list of LM Harness tasks, refer to tasks.

Parameters#

Request Parameters#

These parameters control how requests are made to the model:

Name	Description	Type	Default	Valid Values
`max_retries`	Maximum number of retries for failed requests.	Integer	Container default	—
`parallelism`	Number of parallel requests to improve throughput.	Integer	Container default	—
`request_timeout`	Timeout in seconds for each request.	Integer	Container default	—
`limit_samples`	Limit the number of samples to evaluate. Useful for testing.	Integer	`null` (all samples)	—

Model Parameters#

These parameters control the model’s generation behavior:

Name	Description	Type	Default	Valid Values
`temperature`	Sampling temperature for generation.	Float	Container default	`0.0–2.0`
`top_p`	Nucleus sampling parameter.	Float	0.01	`0.01–1.0`
`max_tokens`	Maximum number of tokens to generate.	Integer	Container default	—
`stop`	Up to 4 sequences where the API will stop generating further tokens.	Array of strings	—	—

Extra Parameters#

Set these parameters in the params.extra section:

Name	Description	Type	Default	Valid Values
`model_type`	Type of model interface to use. Required for the underlying container, but Evaluator will attempt to guess if not provided.	String	Autodetected	`"completions"`, `"chat"`, `"vlm"`
`hf_token`	HuggingFace token for accessing datasets and tokenizers. Required for tasks that fetch from HuggingFace.	String	—	Valid HF token
`tokenizer`	Path to the tokenizer model. If missing, will attempt to use the target model from HuggingFace.	String	Target model name	HuggingFace model path
`tokenizer_backend`	System for loading the tokenizer.	String	`"huggingface"`	`"huggingface"`, `"tiktoken"`
`num_fewshot`	Number of examples in few-shot context.	Integer	Task-dependent	—
`tokenized_requests`	Whether to use tokenized requests.	Boolean	`false`	`true`, `false`
`downsampling_ratio`	Ratio for downsampling the dataset.	Float	`null` (no downsampling)	`0.0–1.0`

Important Notes#

model_type: Different tasks support different model types. Tasks that support both “completions” and “chat” will default to “chat.” If no preference is detected, defaults to “completions.”
hf_token: Required for tasks that fetch datasets or tokenizers from HuggingFace. Errors will appear in run logs if missing or insufficient permissions.
tokenizer: Some tasks require a tokenizer. NVIDIA internal model names are often lowercase while HuggingFace models use different casing, which can cause failures if not specified correctly.

Metrics#

Core Supported Metrics in LM Evaluation Harness#
Metric Name	Description	Value Range	Notes
`acc`	Accuracy (fraction of correct predictions)	`0–1`	Most common for classification tasks
`acc_norm`	Length-normalized accuracy	`0–1`	Normalizes for answer length
`acc_mutual_info`	Baseline loglikelihood - normalized accuracy	Task-dependent	Used in some specialized tasks
`perplexity`	Perplexity (measure of model uncertainty)	`>0`	Lower is better
`word_perplexity`	Perplexity per word	`>0`	Lower is better
`byte_perplexity`	Perplexity per byte	`>0`	Lower is better
`bits_per_byte`	Bits per byte	`>0`	Lower is better
`matthews_corrcoef`	Matthews correlation coefficient	`-1-1`	For binary/multiclass classification
`f1`	F1 score (harmonic mean of precision and recall)	`0–1`	For classification/QA tasks
`bleu`	BLEU score (text generation quality)	`0–100`	For translation/generation tasks
`chrf`	Character F-score (CHRF)	`0–100`	For translation/generation tasks
`ter`	Translation Edit Rate (TER)	`0–1`	For translation tasks
`prompt_level_strict_acc`	Prompt-level strict accuracy for instruction following	`0–1`	For instruction following tasks like IFEval
`pass@1`	Pass rate for code generation (first attempt)	`0–1`	For code generation tasks

Not all metrics are available for every task. Check the task definition for the exact metrics used.