Tasks Not Explicitly Defined by Framework Definition File#

Introduction#

NeMo Evaluator provides a unified interface and a curated set of pre-defined task configurations for launching evaluations. These task configurations are specified in the Framework Definition File (FDF) to provide a simple and standardized way of running evaluations, with minimum user-provided input required.

However, you can choose to evaluate your model on a task that was not explicitly included in the FDF. To do so, you must specify your task as "<harness name>.<task name>", where the task name originates from the underlying evaluation harness, and ensure that all of the task parameters (e.g., sampling parameters, few-shot settings) are specified correctly. Additionally, you need to determine which endpoint type is appropriate for the task.

Run Evaluation#

In this example, we will use the PolEmo 2.0 task from LM Evaluation Harness. This task consists of consumer reviews in Polish and assesses sentiment analysis abilities. It requires a “completions” endpoint and has the sampling parameters defined as a part of task configuration of the underlying harness.

Note

Make sure to review the task configuration in the underlying harness and ensure the sampling parameters are defined and match your preffered way of running the benchmark.

You can configure the evaluation using the params field in the EvaluationConfig.

1. Prepare the Environment#

Start lm-evaluation-harness Docker container:

docker run --rm -it nvcr.io/nvidia/eval-factory/lm-evaluation-harness:25.10

or install nemo-evaluator and nvidia-lm-eval Python package in your environment of choice:

pip install nemo-evaluator nvidia-lm-eval

2. Run the Evaluation#

from nemo_evaluator.api import evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    EndpointType,
    EvaluationConfig,
    EvaluationTarget,
)

model_name = "meta-llama/Llama-3.1-8B"
completions_url = "http://0.0.0.0:8000/v1/completions/"


target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url=completions_url,
        type=EndpointType.COMPLETIONS,
        model_id=model_name,
    )
)

eval_config = EvaluationConfig(
    type="lm-evaluation-harness.polemo2",
    output_dir="/results/",
    # params={  # pass params to adjust how the benchmark is run
    #     "temperature": 0,
    #     "top_p": 0,
    #     "max_new_tokens": 50,
    # },
)


results = evaluate(target_cfg=target_config, eval_cfg=eval_config)


print(results)