Text Generation Evaluation#

Text generation evaluation is the primary method for assessing LLM capabilities where models produce natural language responses to prompts. This approach evaluates the quality, accuracy, and appropriateness of generated text across various tasks and domains.

Tip

In the example below we use the gpqa_diamond benchmark, but the instructions provided apply to all text generation tasks, such as:

mmlu
mmlu_pro
ifeval
gsm8k
mgsm
mbpp

Before You Start#

Ensure you have:

Model Endpoint: An OpenAI-compatible API endpoint for your model (completions or chat). See Testing Endpoint Compatibility for snippets you can use to test your endpoint.
API Access: Valid API key if your endpoint requires authentication
Installed Packages: NeMo Evaluator or access to evaluation containers

Evaluation Approach#

In text generation evaluation:

Prompt Construction: Models receive carefully crafted prompts (questions, instructions, or text to continue)
Response Generation: Models generate natural language responses using their trained parameters
Response Assessment: Generated text is evaluated for correctness, quality, or adherence to specific criteria
Metric Calculation: Numerical scores are computed based on evaluation criteria

This differs from log-probability evaluation where models assign confidence scores to predefined choices. For log-probability methods, see the Evaluate LLMs Using Log-Probabilities.

Use NeMo Evaluator Launcher#

Use an example config for evaluating the Meta Llama 3.1 8B Instruct model:

defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: llama_3_1_8b_instruct_results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com

# specify the benchmarks to evaluate
evaluation:
  nemo_evaluator_config: # global config settings that apply to all tasks
    config:
      params:
        request_timeout: 3600 # timeout for API requests in seconds
        parallelism: 1 # number of parallel requests
        limit_samples: 5 # TEST ONLY: Limits all benchmarks to 10 samples total for quick testing
  tasks:
    - name: gpqa_diamond
      env_vars:
        HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa

To launch the evaluation, run:

export HF_TOKEN_FOR_GPQA_DIAMOND=hf_your-token-here  # GPQA is a gated dataset
export NGC_API_KEY=nvapi-your-token-here  # API Key with access to build.nvidia.com

nemo-evaluator-launcher run \
  --config packages/nemo-evaluator-launcher/examples/local_limit_samples.yaml

Use NeMo Evaluator#

Start simple-evals docker container:

docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:25.10

or install nemo-evaluator and nvidia-simple-evals Python package in your environment of choice:

pip install nemo-evaluator nvidia-simple-evals

Run with CLI#

export HF_TOKEN_FOR_GPQA_DIAMOND=hf_your-token-here  # GPQA is a gated dataset
export NGC_API_KEY=nvapi-your-token-here  # API Key with access to build.nvidia.com

# Run evaluation
nemo-evaluator run_eval \
    --eval_type gpqa_diamond \
    --model_id meta/llama-3.1-8b-instruct \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --model_type chat \
    --api_key_name NGC_API_KEY \
    --output_dir ./llama_3_1_8b_instruct_results

Run with Python API#

# set env variables before entering Python:
# export HF_TOKEN_FOR_GPQA_DIAMOND=hf_your-token-here  # GPQA is a gated dataset
# export NGC_API_KEY=nvapi-your-token-here  # API Key with access to build.nvidia.com

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint, EvaluationConfig, EvaluationTarget, ConfigParams, EndpointType
)

# Configure target endpoint
api_endpoint = ApiEndpoint(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    type=EndpointType.CHAT,
    model_id="meta/llama-3.1-8b-instruct",
    api_key="NGC_API_KEY"  # variable name storing the key
)
target = EvaluationTarget(api_endpoint=api_endpoint)

# Configure evaluation task
config = EvaluationConfig(
    type="gpqa_diamond",
    output_dir="./llama_3_1_8b_instruct_results"
)

# Execute evaluation
results = evaluate(target_cfg=target, eval_cfg=config)