Text Generation Evaluation#

Text generation evaluation is the primary method for assessing LLM capabilities where models produce natural language responses to prompts. This approach evaluates the quality, accuracy, and appropriateness of generated text across various tasks and domains.

Before You Start#

Ensure you have:

  1. Model Endpoint: An OpenAI-compatible API endpoint for your model (completions or chat)

  2. API Access: Valid API key if your endpoint requires authentication

  3. Installed Packages: NeMo Evaluator or access to evaluation containers

  4. Sufficient Resources: Adequate compute for your chosen benchmarks

Pre-Flight Check#

Verify your setup before running full evaluation:

    try:
        response = requests.post(
            endpoint_url,
            headers={"Authorization": f"Bearer {api_key}"},
            json={
                "model": model_id,
                "messages": [{"role": "user", "content": "Hello"}],
                "max_tokens": 10,
            },
            timeout=10,
        )
        assert response.status_code == 200, (
            f"Endpoint returned status {response.status_code}"
        )
        print("✓ Endpoint ready for evaluation")
        return True
    except Exception as e:
        print(f"✗ Endpoint check failed: {e}")
        print("Ensure your API key is valid and the endpoint is accessible")
        return False

Tip

Run this script directly: python docs/evaluation/_snippets/prerequisites/endpoint_check.py


Evaluation Approach#

In text generation evaluation:

  1. Prompt Construction: Models receive carefully crafted prompts (questions, instructions, or text to continue)

  2. Response Generation: Models generate natural language responses using their trained parameters

  3. Response Assessment: Generated text is evaluated for correctness, quality, or adherence to specific criteria

  4. Metric Calculation: Numerical scores are computed based on evaluation criteria

This differs from log-probability evaluation where models assign confidence scores to predefined choices. For log-probability methods, see the Log-Probability Evaluation guide.

Choose Your Approach#

Recommended - The fastest way to run text generation evaluations with unified CLI:

# List available text generation tasks
nemo-evaluator-launcher ls tasks

# Run MMLU Pro evaluation
nemo-evaluator-launcher run \
    --config-dir packages/nemo-evaluator-launcher/examples \
    --config-name local_llama_3_1_8b_instruct \
    -o 'evaluation.tasks=["mmlu_pro"]' \
    -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o target.api_endpoint.api_key=${YOUR_API_KEY}

# Run multiple text generation benchmarks
nemo-evaluator-launcher run \
    --config-dir packages/nemo-evaluator-launcher/examples \
    --config-name local_text_generation_suite \
    -o 'evaluation.tasks=["mmlu_pro", "arc_challenge", "hellaswag", "truthfulqa"]'

For programmatic evaluation in custom workflows:

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType
)

# Configure text generation evaluation
eval_config = EvaluationConfig(
    type="mmlu_pro",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=None,  # Full dataset
        temperature=0.01,    # Near-deterministic for reproducibility
        max_new_tokens=512,
        top_p=0.95
    )
)

target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        model_id="meta/llama-3.1-8b-instruct", 
        type=EndpointType.CHAT,
        api_key="MY_API_KEY"  # Environment variable name containing your API key
    )
)

result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
print(f"Evaluation completed: {result}")

For specialized container workflows:

# Pull and run text generation container
docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:25.09 bash

# Inside container - set environment
export MY_API_KEY=your_api_key_here

# Run evaluation
nemo-evaluator run_eval \
    --eval_type mmlu_pro \
    --model_id meta/llama-3.1-8b-instruct \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --model_type chat \
    --api_key_name MY_API_KEY \
    --output_dir /tmp/results \
    --overrides 'config.params.limit_samples=100'

Discovering Available Tasks#

Use the launcher CLI to discover all available text generation tasks:

# List all available benchmarks
nemo-evaluator-launcher ls tasks

# Output as JSON for programmatic filtering
nemo-evaluator-launcher ls tasks --json

# Filter for specific task types (example: academic benchmarks)
nemo-evaluator-launcher ls tasks | grep -E "(mmlu|gsm8k|arc)"

Run these commands to discover the complete list of available benchmarks across all installed frameworks.

Text Generation Task Categories#

Text Generation Benchmarks Overview#

Area

Purpose

Example Tasks

Evaluation Method

Academic Benchmarks

Assess general knowledge and reasoning across academic domains

  • mmlu

  • mmlu_pro

  • arc_challenge

  • hellaswag

  • truthfulqa

Multiple-choice or short-answer text generation

Instruction Following

Evaluate ability to follow complex instructions and formatting requirements

  • ifeval

  • gpqa_diamond

Generated responses assessed against instruction criteria

Mathematical Reasoning

Test mathematical problem-solving and multi-step reasoning

  • gsm8k

  • math

Final answer extraction and numerical comparison

Multilingual Evaluation

Assess capabilities across different languages

  • mgsm (multilingual GSM8K)

Language-specific text generation and assessment

Note

Task availability depends on installed frameworks. Use nemo-evaluator-launcher ls tasks to see the complete list for your environment.

Task Naming and Framework Specification#

Use simple task names when only one framework provides the task:

# Unambiguous task names
config = EvaluationConfig(type="mmlu")
config = EvaluationConfig(type="gsm8k")
config = EvaluationConfig(type="arc_challenge")

These tasks have unique names across all evaluation frameworks, so no qualification is needed.

When multiple frameworks provide the same task, specify the framework explicitly:

# Explicit framework specification
config = EvaluationConfig(type="lm-evaluation-harness.mmlu")
config = EvaluationConfig(type="simple-evals.mmlu")

Use this approach when:

  • Multiple frameworks implement the same benchmark

  • You need specific framework behavior or scoring

  • Avoiding ambiguity in task resolution

Resolve task naming conflicts by listing available tasks:

from nemo_evaluator import show_available_tasks

# Display all tasks organized by framework
print("Available tasks by framework:")
show_available_tasks()

Or use the CLI for programmatic access:

# List all tasks with framework information
nemo-evaluator-launcher ls tasks

# Filter for specific tasks
nemo-evaluator-launcher ls tasks | grep mmlu

This helps you:

  • Identify which framework implements a task

  • Resolve naming conflicts programmatically

  • Understand available task sources

Evaluation Configuration#

Basic Configuration Structure#

Text generation evaluations use the NVIDIA Eval Commons framework:

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint, EvaluationConfig, EvaluationTarget, ConfigParams, EndpointType
)

# Configure target endpoint
api_endpoint = ApiEndpoint(
    url="http://0.0.0.0:8080/v1/completions/",
    type=EndpointType.COMPLETIONS,
    model_id="megatron_model"
)
target = EvaluationTarget(api_endpoint=api_endpoint)

# Configure evaluation parameters
params = ConfigParams(
    temperature=0.01,   # Near-deterministic generation
    top_p=1.0,          # No nucleus sampling
    limit_samples=100,  # Evaluate subset for testing
    parallelism=1       # Single-threaded requests
)

# Configure evaluation task
config = EvaluationConfig(
    type="mmlu",
    params=params,
    output_dir="./evaluation_results"
)

# Execute evaluation
results = evaluate(target_cfg=target, eval_cfg=config)

Endpoint Types#

Completions Endpoint (/v1/completions/):

  • Direct text completion without conversation formatting

  • Used for: Academic benchmarks, reasoning tasks, base model evaluation

  • Model processes prompts as-is without applying chat templates

Chat Endpoint (/v1/chat/completions/):

  • Conversational interface with role-based message formatting

  • Used for: Instruction following, chat benchmarks, instruction-tuned models

  • Requires models with defined chat templates

Configuration Parameters#

Quick Reference - Essential Parameters:

# Minimal configuration for academic benchmark evaluation
params = ConfigParams(
    temperature=0.01,  # Near-deterministic (0.0 not supported by all endpoints)
    top_p=1.0,  # No nucleus sampling
    max_new_tokens=256,  # Sufficient for most academic tasks
    limit_samples=100,  # Remove for full dataset
    parallelism=4,  # Adjust based on endpoint capacity
)

See also

Complete Parameter Reference

This guide shows minimal configuration for getting started. For comprehensive parameter options including:

  • Framework-specific parameters (num_fewshot, tokenizer, etc.)

  • Optimization patterns for different scenarios

  • Troubleshooting common configuration issues

  • Performance tuning guidelines

See Evaluation Configuration Parameters.

Key Parameters for Text Generation:

  • temperature: Use 0.01 for near-deterministic, reproducible results

  • max_new_tokens: Controls maximum response length

  • limit_samples: Limits evaluation to a subset for testing

  • parallelism: Balances speed with server capacity

Understanding Results#

After evaluation completes, you’ll receive structured results with task-level metrics:

# Access evaluation results
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)

# Access task-level metrics
task_result = result.tasks["mmlu_pro"]
accuracy = task_result.metrics["acc"].scores["acc"].value
print(f"MMLU Pro Accuracy: {accuracy:.2%}")

# Access metrics with statistics
acc_metric = task_result.metrics["acc"]
acc = acc_metric.scores["acc"].value
stderr = acc_metric.scores["acc"].stats.stderr
print(f"Accuracy: {acc:.3f} ± {stderr:.3f}")

Common Metrics#

  • acc (Accuracy): Percentage of correct responses

  • acc_norm (Normalized Accuracy): Length-normalized scoring (often more reliable)

  • exact_match: Exact string match percentage

  • f1: F1 score for token-level overlap

Each metric includes statistics (mean, stderr) for confidence intervals.

Multi-Task Evaluation#

Evaluate across multiple academic benchmarks in a single workflow:

from nemo_evaluator.core.evaluate import evaluate

# Configure target endpoint (reused for all tasks)
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        model_id="meta/llama-3.1-8b-instruct",
        type=EndpointType.CHAT,
        api_key="YOUR_API_KEY",
    )
)

# Define academic benchmark suite
academic_tasks = ["mmlu_pro", "gsm8k", "arc_challenge"]
results = {}

# Run evaluations
for task in academic_tasks:
    eval_config = EvaluationConfig(
        type=task,
        output_dir=f"./results/{task}/",
        params=ConfigParams(
            limit_samples=50,  # Quick testing
            temperature=0.01,  # Deterministic
            parallelism=4,
        ),
    )

    results[task] = evaluate(eval_cfg=eval_config, target_cfg=target_config)
    print(f"✓ Completed {task}")

# Summary report
print("\nAcademic Benchmark Results:")
for task_name, result in results.items():
    if task_name in result.tasks:
        task_result = result.tasks[task_name]
        if "acc" in task_result.metrics:
            acc = task_result.metrics["acc"].scores["acc"].value
            print(f"{task_name:20s}: {acc:.2%}")

Tip

Run this example: python docs/evaluation/_snippets/api-examples/multi_task.py

Common Issues#

“Temperature cannot be 0.0” Error

Some endpoints don’t support exact 0.0 temperature. Use 0.01 instead:

params = ConfigParams(temperature=0.01)  # Near-deterministic
Slow Evaluation Performance

Symptoms: Evaluation takes too long or times out

Solutions:

  • Increase parallelism (start with 4, scale to 8-16 based on endpoint capacity)

  • Reduce request_timeout if requests hang

  • Use limit_samples for initial testing before full runs

  • Check endpoint health and availability

# Optimized configuration
params = ConfigParams(
    parallelism=8,           # Higher concurrency
    request_timeout=120,     # Appropriate timeout
    limit_samples=100,       # Test subset first
    max_retries=3           # Retry failed requests
)
API Authentication Errors

Symptoms: 401 or 403 errors during evaluation

Solutions:

  • Verify api_key parameter contains the environment variable NAME, not the key value

  • Ensure the environment variable is set: export YOUR_API_KEY="actual_key_value"

  • Check API key has necessary permissions

# Correct setup
export MY_API_KEY="nvapi-..."
# Use environment variable name
api_endpoint=ApiEndpoint(
    api_key="MY_API_KEY"  # Name of env var, not the value
)
Task Not Found Error

Symptoms: Task name not recognized

Solutions:

  • Verify task name with nemo-evaluator-launcher ls tasks

  • Check if evaluation framework is installed

  • Use framework-qualified names for ambiguous tasks (e.g., lm-evaluation-harness.mmlu)

# Discover available tasks
nemo-evaluator-launcher ls tasks | grep mmlu

Next Steps#