Evaluate LLMs Using Log-Probabilities#

This guide demonstrates how to evaluate a Large Language Model using log-probabilities. It provides a complete, practical example of running this evaluation on a NeMo checkpoint with the arc_challenge task.

The instructions provided in this example apply to all nvidia-lm-eval tasks utilizing log-probabilities:

  • arc_challenge

  • arc_multilingual

  • bbh

  • commonsense_qa

  • hellaswag

  • hellaswag_multilingual

  • musr

  • openbookqa

  • piqa

  • social_iqa

  • truthfulqa

  • winogrande

Please note that all benchmarks that use log-probabilities require a “completions” endpoint. Please refer to “Evaluate Checkpoints Trained by NeMo Framework” for more information on different endpoint types.

Introduction#

While the most typical approach to LLM evaluation involves assessing the quality of a model’s generated response to a question, an alternative method uses log-probabilities.

In this approach, we quantify a model’s “surprise” or uncertainty when processing a text sequence. This is done by calculating a sum of log-probabilities that the model assigns to each token. A higher sum indicates the model is more confident about the sequence.

In this evaluation approach:

  • The LLM is given a single combined text containing both the question and a potential answer.

  • Next, the sum of log-probabilities is calculated only for the tokens that belong to the answer.

  • This allows an assessment of how likely it is that the model would provide that answer for the given question.

For multiple-choice scenarios, the answer with the highest sum is treated as the one selected by the model.

The sum of log-probabilities can be used to calculate different metrics, such as perplexity. Additionally, log-probabilities can be analyzed to assess whether a response would be generated by the model using greedy sampling—a method commonly employed to evaluate accuracy.

Using log-probabilities is especially useful for evaluating base (pre-trained) models, as it eliminates the need for complex instruction-following and does not require the model to adhere to a specific output format.

Evaluate a NeMo Checkpoint with arc_challenge#

In this example, we will use the arc_challenge task from nvidia-lm-eval. The nvidia-lm-eval package comes pre-installed with the NeMo Framework Docker image. If you are using a different environment, install the evaluation package:

pip install nvidia-lm-eval
  1. Deploy your model:

 1# File deploy.py
 2
 3from nemo_eval.api import deploy
 4
 5CHECKPOINT_PATH = "/checkpoints/llama-3_2-1b-instruct_v2.0"
 6
 7if __name__ == "__main__":
 8    deploy(
 9        nemo_checkpoint=CHECKPOINT_PATH,
10        max_input_len=8192,
11    )
python deploy.py

You can verify if the server is ready for accepting requests with the following function:

from nemo_eval.utils.base import check_endpoint

check_endpoint(
    endpoint_url="http://0.0.0.0:8080/v1/completions/",
    endpoint_type="completions",
    model_name="megatron_model",
)

The server will return the log-probabilities of tokens if it receives a logprob=<int> parameter in the request. When combined with echo=true, the model will include the input in its response, along with the corresponding log-probabilities.

This process occurs behind the scenes when running an evaluation on arc_challenge.

  1. To evaluate your model on arc_challenge benchmark, use the following code:

Make sure to open a new terminal within the same container to execute it.

 1from nvidia_eval_commons.api.api_dataclasses import (
 2    ApiEndpoint,
 3    ConfigParams,
 4    EndpointType,
 5    EvaluationConfig,
 6    EvaluationTarget,
 7)
 8from nvidia_eval_commons.core.evaluate import evaluate
 9
10model_name = "megatron_model"
11completions_url = "http://0.0.0.0:8080/v1/completions/"
12
13
14target_config = EvaluationTarget(
15    api_endpoint=ApiEndpoint(url=completions_url, type=EndpointType.COMPLETIONS, model_id=model_name)
16)
17eval_config = EvaluationConfig(
18    type="arc_challenge",
19    output_dir="/results/",
20    params=ConfigParams(
21        limit_samples=10,
22        temperature=0,
23        top_p=0,
24        parallelism=1,
25        extra={
26            "tokenizer": "/checkpoints/llama-3_2-1b-instruct_v2.0/context/nemo_tokenizer",
27            "tokenizer_backend": "huggingface",
28        },
29    ),
30)
31
32results = evaluate(target_cfg=target_config, eval_cfg=eval_config)
33
34
35print(results)

Note in the example above you must provide a path to the tokenizer:

        "extra": {
            "tokenizer": "/checkpoints/llama-3_2-1b-instruct_v2.0/context/nemo_tokenizer",
            "tokenizer_backend": "huggingface",
        },

This is required to tokenize the input on the client side and to isolate the log-probabilities corresponding specifically to the answer portion of the input.

This example uses only 10 samples. To evaluate the full dataset, remove the "limit_samples" parameter.