Evaluate LLMs Using Log-Probabilities#

Introduction#

While the most typical approach to LLM evaluation involves assessing the quality of a model’s generated response to a question, an alternative method uses log-probabilities.

In this approach, we quantify a model’s “surprise” or uncertainty when processing a text sequence. This is done by calculating the sum of log-probabilities that the model assigns to each token. A higher sum indicates the model is more confident about the sequence.

In this evaluation approach:

  • The LLM is given a single combined text containing both the question and a potential answer.

  • Next, the sum of log-probabilities is calculated only for the tokens that belong to the answer.

  • This allows an assessment of how likely it is that the model would provide that answer for the given question.

For multiple-choice scenarios, the answer with the highest sum is treated as the one selected by the model.

The sum of log-probabilities can be used to calculate different metrics, such as perplexity. Additionally, log-probabilities can be analyzed to assess whether a response would be generated by the model using greedy sampling—a method commonly employed to evaluate accuracy.

Using log-probabilities is especially useful for evaluating base (pre-trained) models, as it eliminates the need for complex instruction-following and does not require the model to adhere to a specific output format.

Tip

In the example below we use the piqa benchmark, but the instructions provided apply to all lm-evaluation-harness tasks utilizing log-probabilities, such as:

  • arc_challenge

  • arc_multilingual

  • bbh

  • commonsense_qa

  • hellaswag

  • hellaswag_multilingual

  • musr

  • openbookqa

  • social_iqa

  • truthfulqa

  • winogrande

Before You Start#

Ensure you have:

  • Completions Endpoint: Log-probability tasks require completions endpoints (not chat) that supports logprobs and echo parameters (see Log-probabilities testing)

  • Model Tokenizer: Access to tokenizer files for client-side tokenization (supported types: huggingface or tiktoken)

  • API Access: Valid API key for your model endpoint if it is gated

  • Authentication: Hugging Face token for gated datasets and tokenizers

Use NeMo Evaluator Launcher#

Use an example config for deploying and evaluating the Meta Llama 3.1 8B model:

defaults:
  - execution: local
  - deployment: vllm
  - _self_

execution:
  output_dir: llama_local
  env_vars:
    deployment:
      HF_TOKEN: ${oc.env:HF_TOKEN}  # needed to access meta-llama/Llama-3.1-8B gated model

# specify deployment arguments
deployment:
  checkpoint_path: null
  hf_model_handle: meta-llama/Llama-3.1-8B
  served_model_name: meta-llama/Llama-3.1-8B
  tensor_parallel_size: 1
  data_parallel_size: 1
  extra_args: "--max-model-len 32768"

# specify the benchmarks to evaluate
evaluation:
  nemo_evaluator_config:  # global config settings that apply to all tasks
    config:
      params:
        extra:  # for log-probability tasks like piqa, you need to specify the tokenizer
          tokenizer: meta-llama/Llama-3.1-8B  # or use a path to locally stored checkpoint
          tokenizer_backend: huggingface      # or "tiktoken"
  env_vars:
      HF_TOKEN: HF_TOKEN  # needed to access the tokenizer on the client side
  tasks:
    - name: piqa

To launch the evaluation, run:

nemo-evaluator-launcher run \
  --config packages/nemo-evaluator-launcher/examples/local_vllm_llama_3_1_8b_logprobs.yaml

Tip

Set deployment: none and provide target specification if you want to evaluate an existing endpoint instead:

defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: llama_local
  env_vars:
    HF_TOKEN: ${oc.env:HF_TOKEN}  # needed to access meta-llama/Llama-3.1-8B gated model
 
target:
  api_endpoint:
    model_id: meta-llama/Llama-3.1-8B
    url: https://your-endpoint.com/v1/completions
    api_key_name: API_KEY # API Key with access to provided url

# specify the benchmarks to evaluate
evaluation:
  nemo_evaluator_config:  # global config settings that apply to all tasks
    config:
      params:
        extra:  # for log-probability tasks like piqa, you need to specify the tokenizer
          tokenizer: meta-llama/Llama-3.1-8B  # or use a path to locally stored checkpoint
          tokenizer_backend: huggingface      # or "tiktoken"
  tasks:
    - name: piqa

Use NeMo Evaluator#

Start lm-evaluation-harness docker container:

docker run --rm -it nvcr.io/nvidia/eval-factory/lm-evaluation-harness:25.10

or install nemo-evaluator and nvidia-lm-eval Python package in your environment of choice:

pip install nemo-evaluator nvidia-lm-eval

To launch the evaluation, run the following Python code:

from nemo_evaluator.api import evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EvaluationConfig,
    EvaluationTarget,
)

# Configure the evaluation target
api_endpoint = ApiEndpoint(
    url="http://0.0.0.0:8000/v1/completions/",
    type="completions",
    model_id="meta-llama/Llama-3.1-8B",
)
eval_target = EvaluationTarget(api_endpoint=api_endpoint)
eval_params = ConfigParams(
    extra={
        "tokenizer": "meta-llama/Llama-3.1-8B",  # or path to locally stored checkpoint with tokenizer
        "tokenizer_backend": "huggingface",  # or "tiktoken"
    },
)
eval_config = EvaluationConfig(type="piqa", params=eval_params, output_dir="results")

evaluate(target_cfg=eval_target, eval_cfg=eval_config)

Make sure to provide the source for the tokenizer and a backend for loading it.

For models trained with NeMo Framework, the tokenizer is stored inside the checkpoint directory. For the NeMo format it is available inside context/nemo_tokenizer subdirectory:

    extra={
        "tokenizer": "/workspace/llama3_8b_nemo2/context/nemo_tokenizer",
        "tokenizer_backend": "huggingface",
    },

For Megatron Bridge checkpoints, the tokenizer is stored under tokenizer subdirectory:

    extra={
        "tokenizer": "/workspace/mbridge_llama3_8b/iter_0000000/tokenizer",
        "tokenizer_backend": "huggingface",
    },

How it works#

When the server receives a logprob=<int> parameter in the request, it will return the log-probabilities of tokens. When combined with echo=true, the model will include the input in its response, along with the corresponding log-probabilities.

Then the recieved response is processed on the client (benchmark) side to isolate the log-probabilities corresponding specifically to the answer portion of the input. For this purpose the input is tokenized, which allows to trace which log-probabilities originated from the question, and which from the answer.