Run Evaluation Using a Task Without a Pre-Defined Config#

Introduction#

NVIDIA Eval Factory packages provide a unified interface and a set of pre-defined task configurations for launching evaluations. However, you can choose to evaluate your model on a task that was not included in this set. To do so, you must specify your tasks as "<harness name>.<task name>", where the task name originates from the underlying evaluation harness. For example, NVIDIA LM-Eval is a wrapper around the LM-Evaluation-Harness. In case of nvidia-lm-eval, it is a wrapper for lm-evaluation-harness.

Please note that when launching custom tasks, the default settings may not be optimal. You must manually provide the recommended configuration (e.g., few-shot settings). Additionally, you need to determine which endpoint type is appropriate for the task. Please refer to “Deploy and Evaluate NeMo Checkpoints” for more information on different endpoint types.

Evaluate a NeMo Checkpoint with lambada_openai#

In this example, we will use the lambada_openai task from nvidia-lm-eval. The nvidia-lm-eval package comes pre-installed with the NeMo Framework Docker image. If you are using a different environment, install the evaluation package:

pip install nvidia-lm-eval==25.6

Deploy your model:

# File deploy.py

from nemo_eval.api import deploy

CHECKPOINT_PATH = "/checkpoints/llama-3_2-1b-instruct_v2.0"

if __name__ == "__main__":
    deploy(
        nemo_checkpoint=CHECKPOINT_PATH,
        max_input_len=8192,
    )

python deploy.py

Configure and run the evaluation:

Be sure to launch a new terminal inside the same container before running the command.

from nemo_eval.api import evaluate
from nemo_eval.utils.api import EvaluationConfig, EvaluationTarget

model_name = "megatron_model"
completions_url = "http://0.0.0.0:8080/v1/completions/"


target_config = EvaluationTarget(
    api_endpoint={
        "url": completions_url,
        "type": "completions",
    }
)
eval_config = EvaluationConfig(
    type="lm-evaluation-harness.lambada_openai",
    output_dir="/results/",
    params={
        "limit_samples": 10,
        "extra": {
            "tokenizer": "/checkpoints/llama-3_2-1b-instruct_v2.0/context/nemo_tokenizer",
            "tokenizer_backend": "huggingface",
        },
    },
)


results = evaluate(target_cfg=target_config, eval_cfg=eval_config)

Please note that lambada_openai uses log-probabilities for evaluation. To learn more about this approach, please see “Evaluate LLMs Using Log-Probabilities”.

This example uses only 10 samples. To evaluate the full dataset, remove the "limit_samples" parameter.