Evaluate LLMs Using Log-Probabilities#

Introduction#

While the most typical approach to LLM evaluation involves assessing the quality of a model’s generated response to a question, an alternative method uses log-probabilities.

In this approach, we quantify a model’s “surprise” or uncertainty when processing a text sequence. This is done by calculating the sum of log-probabilities that the model assigns to each token. A higher sum indicates the model is more confident about the sequence.

In this evaluation approach:

The LLM is given a single combined text containing both the question and a potential answer.
Next, the sum of log-probabilities is calculated only for the tokens that belong to the answer.
This allows an assessment of how likely it is that the model would provide that answer for the given question.

For multiple-choice scenarios, the answer with the highest sum is treated as the one selected by the model.

The sum of log-probabilities can be used to calculate different metrics, such as perplexity. Additionally, log-probabilities can be analyzed to assess whether a response would be generated by the model using greedy sampling—a method commonly employed to evaluate accuracy.

Using log-probabilities is especially useful for evaluating base (pre-trained) models, as it eliminates the need for complex instruction-following and does not require the model to adhere to a specific output format.

Tip

In the example below we use the piqa benchmark, but the instructions provided apply to all lm-evaluation-harness tasks utilizing log-probabilities, such as:

arc_challenge
arc_multilingual
bbh
commonsense_qa
hellaswag
hellaswag_multilingual
musr
openbookqa
social_iqa
truthfulqa
winogrande

Before You Start#

Ensure you have:

Completions Endpoint: Log-probability tasks require completions endpoints (not chat) that supports logprobs and echo parameters (see Log-probabilities testing)
Model Tokenizer: Access to tokenizer files for client-side tokenization (supported types: huggingface or tiktoken)
API Access: Valid API key for your model endpoint if it is gated
Authentication: Hugging Face token for gated datasets and tokenizers

Use NeMo Evaluator Launcher#

Use an example config for deploying and evaluating the Meta Llama 3.1 8B model:

defaults:
  - execution: local
  - deployment: vllm
  - _self_

execution:
  output_dir: llama_local
deployment:
  checkpoint_path: null
  hf_model_handle: meta-llama/Llama-3.1-8B
  served_model_name: meta-llama/Llama-3.1-8B
  tensor_parallel_size: 1
  data_parallel_size: 1
  extra_args: "--max-model-len 32768"
  env_vars:
    HF_TOKEN: host:HF_TOKEN

# specify the benchmarks to evaluate
evaluation:
  # global config settings that apply to all tasks, unless overridden by task-specific config
  nemo_evaluator_config:
    config:
      params:
        request_timeout: 3600  # timeout for API request in seconds
        parallelism: 1  # 1 parallel request to avoid overloading the server
        # limit_samples: 10 # uncomment to limit number of samples for quick testing
        extra:  # for log-probability tasks like piqa, you need to specify the tokenizer
          tokenizer: meta-llama/Llama-3.1-8B  # or use a path to locally stored checkpoint
          tokenizer_backend: huggingface      # or "tiktoken"
  env_vars:
      HF_TOKEN: host:HF_TOKEN  # needed to access the tokenizer on the client side
  tasks:
    - name: piqa

To launch the evaluation, run:

nemo-evaluator-launcher run \
  --config packages/nemo-evaluator-launcher/examples/local_vllm_logprobs.yaml

Tip

Set deployment: none and provide target specification if you want to evaluate an existing endpoint instead:

defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: llama_local
  env_vars:
    HF_TOKEN: ${oc.env:HF_TOKEN}  # needed to access meta-llama/Llama-3.1-8B gated model
 
target:
  api_endpoint:
    model_id: meta-llama/Llama-3.1-8B
    url: https://your-endpoint.com/v1/completions
    api_key_name: NGC_API_KEY # API Key with access to provided url

# specify the benchmarks to evaluate
evaluation:
  nemo_evaluator_config:  # global config settings that apply to all tasks
    config:
      params:
        extra:  # for log-probability tasks like piqa, you need to specify the tokenizer
          tokenizer: meta-llama/Llama-3.1-8B  # or use a path to locally stored checkpoint
          tokenizer_backend: huggingface      # or "tiktoken"
  tasks:
    - name: piqa

Use NeMo Evaluator#

Start lm-evaluation-harness docker container:

docker run --rm -it nvcr.io/nvidia/eval-factory/lm-evaluation-harness:25.10

or install nemo-evaluator and nvidia-lm-eval Python package in your environment of choice:

pip install nemo-evaluator nvidia-lm-eval

To launch the evaluation, run the following Python code:

from nemo_evaluator.api import evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EvaluationConfig,
    EvaluationTarget,
)

# Configure the evaluation target
api_endpoint = ApiEndpoint(
    url="http://0.0.0.0:8000/v1/completions/",
    type="completions",
    model_id="meta-llama/Llama-3.1-8B",
)
eval_target = EvaluationTarget(api_endpoint=api_endpoint)
eval_params = ConfigParams(
    extra={
        "tokenizer": "meta-llama/Llama-3.1-8B",  # or path to locally stored checkpoint with tokenizer
        "tokenizer_backend": "huggingface",  # or "tiktoken"
    },
)
eval_config = EvaluationConfig(type="piqa", params=eval_params, output_dir="results")

evaluate(target_cfg=eval_target, eval_cfg=eval_config)

Make sure to provide the source for the tokenizer and a backend for loading it.

For models trained with NeMo Framework, the tokenizer is stored inside the checkpoint directory. For the NeMo format it is available inside context/nemo_tokenizer subdirectory:

    extra={
        "tokenizer": "/workspace/llama3_8b_nemo2/context/nemo_tokenizer",
        "tokenizer_backend": "huggingface",
    },

For Megatron Bridge checkpoints, the tokenizer is stored under tokenizer subdirectory:

    extra={
        "tokenizer": "/workspace/mbridge_llama3_8b/iter_0000000/tokenizer",
        "tokenizer_backend": "huggingface",
    },

How it works#

When the server receives a logprob=<int> parameter in the request, it will return the log-probabilities of tokens. When combined with echo=true, the model will include the input in its response, along with the corresponding log-probabilities.

Then the recieved response is processed on the client (benchmark) side to isolate the log-probabilities corresponding specifically to the answer portion of the input. For this purpose the input is tokenized, which allows to trace which log-probabilities originated from the question, and which from the answer.