LM Harness Evaluations#
LM Evaluation Harness supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8K, and IFEval. Use this evaluation type to benchmark general language understanding and reasoning tasks.
Tip
Want to experiment first? You can try these benchmarks using the open-source NeMo Evaluator SDK before deploying the microservice. The SDK provides a lightweight way to test evaluation workflows locally.
Target Configuration
All LM Harness evaluations use the same target structure. Here’s an example targeting a NIM endpoint:
{
"target": {
"type": "model",
"model": {
"api_endpoint": {
"url": "https://<nim-base-url>/v1/chat/completions",
"model_id": "meta/llama-3.3-70b-instruct"
}
}
}
}
Field |
Description |
Required |
Default |
|---|---|---|---|
|
Always |
Yes |
— |
|
The URL of the API endpoint for the model. |
Yes |
— |
|
The model identifier. |
Yes |
— |
|
Whether to use streaming responses. |
No |
|
Example Job Execution#
You can execute an Evaluation Job using either the Python SDK or cURL as follows, replacing <my-eval-config> with configs shown on this page:
Note
See Job Target and Configuration Matrix for details on target / config compatibility.
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(
base_url="http(s)://<your evaluator service endpoint>"
)
job = client.v2.evaluation.jobs.create(
spec={
"target": {
"type": "model",
"name": "my-target-dataset-1",
"namespace": "my-organization",
"model": {
"api_endpoint": {
# Replace NIM_BASE_URL with your specific deployment
"url": f"{NIM_BASE_URL}/v1/chat/completions",
"model_id": "meta/llama-3.1-8b-instruct"
}
},
},
"config": <my-eval-config>
}
)
curl -X "POST" "$EVALUATOR_BASE_URL/v2/evaluation/jobs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"spec": {
"target": {
"type": "model",
"name": "my-target-dataset-1",
"namespace": "my-organization",
"model": {
"api_endpoint": {
# Replace NIM_BASE_URL with your specific deployment
"url": f"{NIM_BASE_URL}/v1/chat/completions",
"model_id": "meta/llama-3.1-8b-instruct"
}
}
},
"config": <my-eval-config>
}
}'
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(
base_url="http(s)://<your evaluator service endpoint>"
)
job = client.evaluation.jobs.create(
namespace="my-organization",
target={
"type": "model",
"namespace": "my-organization",
"model": {
"api_endpoint": {
# Replace NIM_BASE_URL with your specific deployment
"url": f"{NIM_BASE_URL}/v1/chat/completions",
"model_id": "meta/llama-3.1-8b-instruct"
}
},
},
config=<my-eval-config>
)
curl -X "POST" "$EVALUATOR_BASE_URL/v1/evaluation/jobs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"namespace": "my-organization",
"target": {
"type": "model",
"namespace": "my-organization",
"model": {
"api_endpoint": {
# Replace NIM_BASE_URL with your specific deployment
"url": f"{NIM_BASE_URL}/v1/chat/completions",
"model_id": "meta/llama-3.1-8b-instruct"
}
}
},
"config": <my-eval-config>
}'
For a full example, see Run an Academic LM Harness Eval
Supported Tasks#
Category |
Example Task(s) |
Description |
|---|---|---|
Advanced Reasoning |
|
Big-Bench Hard, multistep reasoning, and graduate-level Q&A tasks. |
Instruction Following |
|
Tests ability to follow specific instructions. |
Language Understanding |
|
Massive Multitask Language Understanding; covers 57 subjects across STEM, humanities, and more. |
Math & Reasoning |
|
Grade school and advanced math word problems. |
Multilingual Tasks |
|
Math word problems and translation tasks in multiple languages. |
For the full list of LM Harness tasks, see the lm-evaluation-harness tasks directory or run python -m lm_eval --tasks list.
Advanced Reasoning (GPQA)#
{
"type": "gpqa_diamond_cot",
"params": {
"max_tokens": 1024,
"temperature": 1.0,
"top_p": 0.01,
"extra": {
"model_type": "chat",
"hf_token": "hf_your_token_here",
}
}
}
{
"question": "What is the capital of France?",
"choices": ["Paris", "London", "Berlin", "Madrid"],
"answer": "Paris",
"output": "Paris"
}
{
"tasks": {
"gpqa_diamond_cot_zeroshot": {
"metrics": {
"exact_match__flexible-extract": {
"scores": {
"exact_match__flexible-extract": {
"value": 1.0
}
}
}
}
}
}
}
Instruction Following (IFEval)#
{
"type": "ifeval",
"params": {
"max_retries": 5,
"parallelism": 10,
"request_timeout": 300,
"limit_samples": 50,
"temperature": 1.0,
"top_p": 0.01,
"max_tokens": 1024,
"extra": {
"model_type": "chat",
"hf_token": "hf_your_token_here",
"tokenizer": "meta-llama/Llama-3.2-1B-Instruct",
"tokenizer_backend": "huggingface"
}
}
}
{
"prompt": "Write a short story about a cat. The story must be exactly 3 sentences long.",
"instruction_id_list": ["length_constraints:number_sentences"],
"kwargs": [{"num_sentences": 3}],
"output": "The cat sat by the window. It watched the birds outside. Then it fell asleep in the warm sunlight."
}
{
"tasks": {
"ifeval": {
"metrics": {
"prompt_level_strict_acc": {
"scores": {
"prompt_level_strict_acc": {
"value": 1.0
}
}
}
}
}
}
}
Language Understanding (MMLU)#
{
"type": "mmlu",
"params": {
"extra": {
"model_type": "completions",
"num_fewshot": 5,
"hf_token": "hf_your_token_here",
"tokenizer": "meta-llama/Llama-3.3-70B-Instruct",
"tokenizer_backend": "huggingface"
}
}
}
{
"question": "Which of the following is a prime number?",
"choices": ["4", "6", "7", "8"],
"answer": "7",
"output": "7"
}
{
"tasks": {
"mmlu_abstract_algebra": {
"metrics": {
"accuracy": {
"scores": {
"accuracy": {
"value": 1.0
}
}
}
}
}
}
}
Math & Reasoning (GSM8K)#
{
"type": "gsm8k",
"params": {
"temperature": 1.0,
"top_p": 0.01,
"max_tokens": 256,
"parallelism": 10,
"extra": {
"model_type": "completions",
"num_fewshot": 8,
"hf_token": "hf_your_token_here",
"tokenizer": "meta-llama/Llama-3.3-70B-Instruct",
"tokenizer_backend": "huggingface"
}
}
}
{
"question": "If you have 3 apples and you get 2 more, how many apples do you have?",
"answer": "5",
"output": "5"
}
{
"tasks": {
"gsm8k": {
"metrics": {
"accuracy": {
"scores": {
"accuracy": {
"value": 1.0
}
}
}
}
}
}
}
For the full list of LM Harness tasks, refer to tasks.
Parameters#
Request Parameters#
These parameters control how requests are made to the model:
Name |
Description |
Type |
Default |
Valid Values |
|---|---|---|---|---|
|
Maximum number of retries for failed requests. |
Integer |
Container default |
— |
|
Number of parallel requests to improve throughput. |
Integer |
Container default |
— |
|
Timeout in seconds for each request. |
Integer |
Container default |
— |
|
Limit the number of samples to evaluate. Useful for testing. |
Integer |
|
— |
Model Parameters#
These parameters control the model’s generation behavior:
Name |
Description |
Type |
Default |
Valid Values |
|---|---|---|---|---|
|
Sampling temperature for generation. |
Float |
Container default |
|
|
Nucleus sampling parameter. |
Float |
0.01 |
|
|
Maximum number of tokens to generate. |
Integer |
Container default |
— |
|
Up to 4 sequences where the API will stop generating further tokens. |
Array of strings |
— |
— |
Extra Parameters#
Set these parameters in the params.extra section:
Name |
Description |
Type |
Default |
Valid Values |
|---|---|---|---|---|
|
Type of model interface to use. Required for the underlying container, but Evaluator will attempt to guess if not provided. |
String |
Autodetected |
|
|
HuggingFace token for accessing datasets and tokenizers. Required for tasks that fetch from HuggingFace. |
String |
— |
Valid HF token |
|
Path to the tokenizer model. If missing, will attempt to use the target model from HuggingFace. |
String |
Target model name |
HuggingFace model path |
|
System for loading the tokenizer. |
String |
|
|
|
Number of examples in few-shot context. |
Integer |
Task-dependent |
— |
|
Whether to use tokenized requests. |
Boolean |
|
|
|
Ratio for downsampling the dataset. |
Float |
|
|
Important Notes#
model_type: Different tasks support different model types. Tasks that support both “completions” and “chat” will default to “chat.” If no preference is detected, defaults to “completions.”
hf_token: Required for tasks that fetch datasets or tokenizers from HuggingFace. Errors will appear in run logs if missing or insufficient permissions.
tokenizer: Some tasks require a tokenizer. NVIDIA internal model names are often lowercase while HuggingFace models use different casing, which can cause failures if not specified correctly.
Metrics#
Metric Name |
Description |
Value Range |
Notes |
|---|---|---|---|
|
Accuracy (fraction of correct predictions) |
|
Most common for classification tasks |
|
Length-normalized accuracy |
|
Normalizes for answer length |
|
Baseline loglikelihood - normalized accuracy |
Task-dependent |
Used in some specialized tasks |
|
Perplexity (measure of model uncertainty) |
|
Lower is better |
|
Perplexity per word |
|
Lower is better |
|
Perplexity per byte |
|
Lower is better |
|
Bits per byte |
|
Lower is better |
|
Matthews correlation coefficient |
|
For binary/multiclass classification |
|
F1 score (harmonic mean of precision and recall) |
|
For classification/QA tasks |
|
BLEU score (text generation quality) |
|
For translation/generation tasks |
|
Character F-score (CHRF) |
|
For translation/generation tasks |
|
Translation Edit Rate (TER) |
|
For translation tasks |
|
Prompt-level strict accuracy for instruction following |
|
For instruction following tasks like IFEval |
|
Pass rate for code generation (first attempt) |
|
For code generation tasks |
Not all metrics are available for every task. Check the task definition for the exact metrics used.