Run Evaluations with NeMo Run#

This tutorial explains how to run evaluations inside NeMo Framework container with NeMo Run. For detailed information about NeMo Run, please refer to its documentation. Below is a concise guide focused on using NeMo Run to perform evaluations in NeMo.

Prerequisites#

Docker installed
NeMo Framework container
Access to a NeMo 2.0 checkpoint (tutorials use Llama 3.2 1B Instruct)
CUDA-compatible GPU with sufficient memory (for running locally) or access to a Slurm-based Cluster (for running on cluster).

NeMo Evaluator repository cloned (for access to scripts)

git clone https://github.com/NVIDIA-NeMo/Evaluator.git

(Optional) Your Hugging Face token if you are using gated datasets (e.g. GPQA-Diamond dataset).

How it works#

The evaluation_with_nemo_run.py script serves as a reference for launching evaluations with NeMo Run. This script demonstrates how to use NeMo Run with both local executors (your local workstation) and Slurm-based executors like clusters. In this setup, the deploy and evaluate processes are launched as two separate jobs with NeMo Run. The evaluate method waits until the PyTriton server is accessible and the model is deployed before starting the evaluations.

For this purpose we define a helper function:

def wait_and_evaluate(target_cfg, eval_cfg):
    server_ready = check_endpoint(
        endpoint_url=target_cfg.api_endpoint.url,
        endpoint_type=target_cfg.api_endpoint.type,
        model_name=target_cfg.api_endpoint.model_id,
    )
    if not server_ready:
        raise RuntimeError(
            "Server is not ready to accept requests. Check the deployment logs for errors."
        )
    return evaluate(target_cfg=target_cfg, eval_cfg=eval_cfg)

The script supports two types of serving: with Triton (default) and with Ray (pass --serving_backend ray flag). User-provided arguments are mapped onto flags exptected by the scripts:

The script supports two modes of running the experiment:

locally, using your environment
remotely, sending the job to the Slurm-based cluster

First, an executor is selected based on the arguments provided by the user, either a local one:

    env_vars = {
        # required for some eval benchmarks from lm-eval-harness
        "HF_DATASETS_TRUST_REMOTE_CODE": "1",
        "HF_TOKEN": "xxxxxx",
    }

    executor = run.LocalExecutor(env_vars=env_vars)

or a Slurm one:

    env_vars = {
        # required for some eval benchmarks from lm-eval-harness
        "HF_DATASETS_TRUST_REMOTE_CODE": "1",
        "HF_TOKEN": "xxxxxx",
    }
    if custom_env_vars:
        env_vars |= custom_env_vars

    packager = run.Config(run.GitArchivePackager, subpath="scripts")

    executor = run.SlurmExecutor(
        account=account,
        partition=partition,
        tunnel=run.SSHTunnel(
            user=user,
            host=host,
            job_dir=remote_job_dir,
        ),
        nodes=nodes,
        ntasks_per_node=devices,
        exclusive=True,
        # archives and uses the local code. Use packager=run.Packager() to use the code code mounted on clusters
        packager=packager,
    )

    executor.container_image = container_image
    executor.container_mounts = mounts
    executor.env_vars = env_vars
    executor.retries = retries
    executor.time = time

Note

Please make sure to update HF_TOKEN with your token

in the NeMo Run script’s local_executor env_vars if using local executor
in the slurm_executor’s env_vars if using slurm_executor.

Then, the two jobs are configured:

    deploy_run_script = run.Script(inline=deploy_script)

    api_endpoint = run.Config(
        ApiEndpoint,
        url=f"http://{args.server_address}:{args.server_port}/v1/{ENDPOINT_TYPES[args.endpoint_type]}",
        type=args.endpoint_type,
        model_id="megatron_model",
    )
    eval_target = run.Config(EvaluationTarget, api_endpoint=api_endpoint)
    eval_params = run.Config(
        ConfigParams,
        limit_samples=args.limit,
        parallelism=args.parallel_requests,
        request_timeout=args.request_timeout,
    )
    eval_config = run.Config(
        EvaluationConfig,
        type=args.eval_task,
        params=eval_params,
        output_dir="/results/",
    )

    eval_fn = run.Partial(
        wait_and_evaluate, target_cfg=eval_target, eval_cfg=eval_config
    )

Finally, the experiment is started:

    with run.Experiment(f"{exp_name}{args.tag}") as exp:
        if args.slurm:
            exp.add(
                [deploy_run_script, eval_fn],
                executor=[executor, executor_eval],
                name=exp_name,
                tail_logs=False,
            )
        else:
            exp.add(
                deploy_run_script,
                executor=executor,
                name=f"{exp_name}_deploy",
                tail_logs=True,
            )
            exp.add(
                eval_fn, executor=executor, name=f"{exp_name}_evaluate", tail_logs=True
            )

        if args.dryrun:
            exp.dryrun()
        else:
            exp.run()

Run Locally#

To run evaluations on your local workstation, use the following command:

cd Evaluator/scripts
python evaluation_with_nemo_run.py \
  --nemo_checkpoint '/workspace/llama3_8b_nemo2/' \
  --eval_task 'gsm8k' \
  --devices 2

Note

When running locally with NeMo Run, you will need to manually terminate the deploy process once evaluations are complete.

Run on Slurm-based Clusters#

To run evaluations on Slurm-based clusters, add the --slurm flag to your command and specify any custom parameters such as user, host, remote_job_dir, account, mounts, etc. Refer to the evaluation_with_nemo_run.py script for further details. Below is an example command:

cd Evaluator/scripts
python evaluation_with_nemo_run.py \
  --nemo_checkpoint='/workspace/llama3_8b_nemo2' \
  --slurm --nodes 1 \
  --devices 8 \
  --container_image "nvcr.io/nvidia/nemo:25.11" \
  --tensor_parallelism_size 8