Evaluate LLMs Using Log-Probabilities#

This guide demonstrates how to evaluate a Large Language Model using log-probabilities. It provides a complete, practical example of running this evaluation on a NeMo checkpoint with the arc_challenge task.

The instructions provided in this example apply to all nvidia-lm-eval tasks utilizing log-probabilities:

  • arc_challenge

  • arc_multilingual

  • bbh

  • commonsense_qa

  • hellaswag

  • hellaswag_multilingual

  • musr

  • openbookqa

  • piqa

  • social_iqa

  • truthfulqa

  • winogrande

Please note that all benchmarks that use log-probabilities require a “completions” endpoint. Please refer to “Evaluate Checkpoints Trained by NeMo Framework” for more information on different endpoint types.

Introduction#

While the most typical approach to LLM evaluation involves assessing the quality of a model’s generated response to a question, an alternative method uses log-probabilities.

In this approach we quantify a model’s “surprise” or uncertainty when processing a text sequence. This is done by calculating a sum of log-probabilities that the model assigns to each token. A higher sum indicates the model is more confident about the sequence.

In this evaluation approach:

  • The LLM is given a single combined text containing both the question and a potential answer.

  • Next, the sum of log-probabilities is calculated only for the tokens that belong to the answer.

  • This allows an assessment of how likely it is that the model would provide that answer for the given question.

For multiple-choice scenarios, the answer with the highest sum is treated as the one selected by the model.

The sum of log-probabilities can be also use to calculate different metrics, e.g., perplexity. Alternatively, log-probabilities can be also analyzed to determine if the answer would be produced by the model with greedy sampling. This approach is used to calculate accuracy.

Using log-probabilities is especially useful for evaluating base (pre-trained) models, as it eliminates the need for complex instruction-following and does not require the model to adhere to a specific output format.

Evaluate a NeMo Checkpoint with arc_challenge#

In this example, we will use the arc_challenge task from nvidia-lm-eval. The nvidia-lm-eval package comes pre-installed with the NeMo Framework Docker image. If you are using a different environment, install the evaluation package:

pip install nvidia-lm-eval==25.6
  1. Deploy your model:

 1# File deploy.py
 2
 3from nemo_eval.api import deploy
 4
 5CHECKPOINT_PATH = "/checkpoints/llama-3_2-1b-instruct_v2.0"
 6
 7if __name__ == "__main__":
 8    deploy(
 9        nemo_checkpoint=CHECKPOINT_PATH,
10        max_input_len=8192,
11    )
python deploy.py

The server will return the log-probabilities of tokens if it receives a logprob=<int> parameter in the request. When combined with echo=true, the model will include the input in its response, along with the corresponding log-probabilities.

This process occurs behind the scenes when running an evaluation on arc_challenge.

  1. To evaluate your model on arc_challenge benchmark, use the following code:

Make sure to open a new terminal within the same container to execute it.

 1from nemo_eval.api import evaluate
 2from nemo_eval.utils.api import EvaluationConfig, EvaluationTarget
 3
 4model_name = "megatron_model"
 5completions_url = "http://0.0.0.0:8080/v1/completions/"
 6
 7
 8target_config = EvaluationTarget(
 9    api_endpoint={
10        "url": completions_url,
11        "type": "completions",
12    }
13)
14eval_config = EvaluationConfig(
15    type="arc_challenge",
16    output_dir="/results/",
17    params={
18        "limit_samples": 10,
19        "extra": {
20            "tokenizer": "/checkpoints/llama-3_2-1b-instruct_v2.0/context/nemo_tokenizer",
21            "tokenizer_backend": "huggingface",
22        },
23    },
24)
25
26
27results = evaluate(target_cfg=target_config, eval_cfg=eval_config)
28
29
30print(results)

Note in the example above you must provide a path to the tokenizer:

        "extra": {
            "tokenizer": "/checkpoints/llama-3_2-1b-instruct_v2.0/context/nemo_tokenizer",
            "tokenizer_backend": "huggingface",
        },

This is required to tokenize the input on the client side and to isolate the log-probabilities corresponding specifically to the answer portion of the input.

This example uses only 10 samples. To evaluate the full dataset, remove the "limit_samples" parameter.