Evaluate LLMs Using Log-Probabilities#

This guide demonstrates how to evaluate a Large Language Model using log-probabilities. It provides a complete, practical example of running this evaluation on a NeMo checkpoint with the arc_challenge task.

The instructions provided in this example apply to all nvidia-lm-eval tasks utilizing log-probabilities:

arc_challenge
arc_multilingual
bbh
commonsense_qa
hellaswag
hellaswag_multilingual
musr
openbookqa
piqa
social_iqa
truthfulqa
winogrande

Please note that all benchmarks that use log-probabilities require a “completions” endpoint. Please refer to “Evaluate Checkpoints Trained by NeMo Framework” for more information on different endpoint types.

Introduction#

While the most typical approach to LLM evaluation involves assessing the quality of a model’s generated response to a question, an alternative method uses log-probabilities.

In this approach we quantify a model’s “surprise” or uncertainty when processing a text sequence. This is done by calculating a sum of log-probabilities that the model assigns to each token. A higher sum indicates the model is more confident about the sequence.

In this evaluation approach:

The LLM is given a single combined text containing both the question and a potential answer.
Next, the sum of log-probabilities is calculated only for the tokens that belong to the answer.
This allows an assessment of how likely it is that the model would provide that answer for the given question.

For multiple-choice scenarios, the answer with the highest sum is treated as the one selected by the model.

The sum of log-probabilities can be also use to calculate different metrics, e.g., perplexity. Alternatively, log-probabilities can be also analyzed to determine if the answer would be produced by the model with greedy sampling. This approach is used to calculate accuracy.

Using log-probabilities is especially useful for evaluating base (pre-trained) models, as it eliminates the need for complex instruction-following and does not require the model to adhere to a specific output format.

Evaluate a NeMo Checkpoint with arc_challenge#

In this example, we will use the arc_challenge task from nvidia-lm-eval. The nvidia-lm-eval package comes pre-installed with the NeMo Framework Docker image. If you are using a different environment, install the evaluation package:

pip install nvidia-lm-eval==25.6

Deploy your model:

# File deploy.py

from nemo_eval.api import deploy

CHECKPOINT_PATH = "/checkpoints/llama-3_2-1b-instruct_v2.0"

if __name__ == "__main__":
    deploy(
        nemo_checkpoint=CHECKPOINT_PATH,
        max_input_len=8192,
    )

python deploy.py

The server will return the log-probabilities of tokens if it receives a logprob=<int> parameter in the request. When combined with echo=true, the model will include the input in its response, along with the corresponding log-probabilities.

This process occurs behind the scenes when running an evaluation on arc_challenge.

To evaluate your model on arc_challenge benchmark, use the following code:

Make sure to open a new terminal within the same container to execute it.

from nemo_eval.api import evaluate
from nemo_eval.utils.api import EvaluationConfig, EvaluationTarget

model_name = "megatron_model"
completions_url = "http://0.0.0.0:8080/v1/completions/"


target_config = EvaluationTarget(
    api_endpoint={
        "url": completions_url,
        "type": "completions",
    }
)
eval_config = EvaluationConfig(
    type="arc_challenge",
    output_dir="/results/",
    params={
        "limit_samples": 10,
        "extra": {
            "tokenizer": "/checkpoints/llama-3_2-1b-instruct_v2.0/context/nemo_tokenizer",
            "tokenizer_backend": "huggingface",
        },
    },
)


results = evaluate(target_cfg=target_config, eval_cfg=eval_config)


print(results)

Note in the example above you must provide a path to the tokenizer:

        "extra": {
            "tokenizer": "/checkpoints/llama-3_2-1b-instruct_v2.0/context/nemo_tokenizer",
            "tokenizer_backend": "huggingface",
        },

This is required to tokenize the input on the client side and to isolate the log-probabilities corresponding specifically to the answer portion of the input.

This example uses only 10 samples. To evaluate the full dataset, remove the "limit_samples" parameter.