Run Evaluation Using a Task Without a Pre-Defined Config#
Introduction#
NVIDIA Eval Factory packages provide a unified interface and a set of pre-defined task configurations for launching evaluations.
However, you can choose to evaluate your model on a task that was not included in this set.
To do so, you must specify your tasks as "<harness name>.<task name>"
, where the task name originates from the underlying evaluation harness. For example, NVIDIA LM-Eval is a wrapper around the LM-Evaluation-Harness.
In case of nvidia-lm-eval
, it is a wrapper for lm-evaluation-harness
.
Please note that when launching custom tasks, the default settings may not be optimal. You must manually provide the recommended configuration (e.g., few-shot settings). Additionally, you need to determine which endpoint type is appropriate for the task. Please refer to “Deploy and Evaluate NeMo Checkpoints” for more information on different endpoint types.
Evaluate a NeMo Checkpoint with lambada_openai#
In this example, we will use the lambada_openai
task from nvidia-lm-eval
.
The nvidia-lm-eval
package comes pre-installed with the NeMo Framework Docker image.
If you are using a different environment, install the evaluation package:
pip install nvidia-lm-eval==25.6
Deploy your model:
1# File deploy.py
2
3from nemo_eval.api import deploy
4
5CHECKPOINT_PATH = "/checkpoints/llama-3_2-1b-instruct_v2.0"
6
7if __name__ == "__main__":
8 deploy(
9 nemo_checkpoint=CHECKPOINT_PATH,
10 max_input_len=8192,
11 )
python deploy.py
Configure and run the evaluation:
Be sure to launch a new terminal inside the same container before running the command.
1from nemo_eval.api import evaluate
2from nemo_eval.utils.api import EvaluationConfig, EvaluationTarget
3
4model_name = "megatron_model"
5completions_url = "http://0.0.0.0:8080/v1/completions/"
6
7
8target_config = EvaluationTarget(
9 api_endpoint={
10 "url": completions_url,
11 "type": "completions",
12 }
13)
14eval_config = EvaluationConfig(
15 type="lm-evaluation-harness.lambada_openai",
16 output_dir="/results/",
17 params={
18 "limit_samples": 10,
19 "extra": {
20 "tokenizer": "/checkpoints/llama-3_2-1b-instruct_v2.0/context/nemo_tokenizer",
21 "tokenizer_backend": "huggingface",
22 },
23 },
24)
25
26
27results = evaluate(target_cfg=target_config, eval_cfg=eval_config)
Please note that lambada_openai
uses log-probabilities for evaluation.
To learn more about this approach, please see “Evaluate LLMs Using Log-Probabilities”.
This example uses only 10 samples.
To evaluate the full dataset, remove the "limit_samples"
parameter.