Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Evaluate NeMo 2.0 Checkpoints#

This guide provides detailed instructions on evaluating NeMo 2.0 checkpoints using the integrated lm-evaluation-harness within the NeMo Framework. Supported benchmarks include MMLU, GSM8k, lambada_openai, winogrande, arc_challenge, arc_easy, and copa.

Introduction#

The evaluation process employs a server-client approach, comprising two main phases. In Phase 1, the NeMo 2.0 checkpoint is deployed on a PyTriton server by exporting it to TRT-LLM. Phase 2 involves running the evaluation on the model using the deployed URL and port.

Run Evaluations without NeMo-Run#

This section outlines the steps to deploy and evaluate a NeMo 2.0 model directly using Python commands, without using NeMo-Run. This method is quick and easy, making it ideal for evaluation on a local workstation with GPUs, as it facilitates easier debugging. However, for running evaluations on clusters, it is recommended to use NeMo-Run for its ease of use.

The entry point for deployment is the deploy method defined in nemo/collections/llm/api.py. Below is an example command for deployment:

from nemo.collections.llm import deploy

if __name__ == "__main__":
    deploy(
        nemo_checkpoint='/workspace/hf_llama3_8b_nemo2.nemo',
        max_input_len=4096,
        max_batch_size=4,
        num_gpus=1,)

The entrypoint for evaluation is the evaluate method defined in nemo/collections/llm/api.py. To run evaluations on the deployed model, use the following command. Make sure to open a new terminal within the same container to execute it. For longer evaluations, it is advisable to run both the deploy and evaluate commands in tmux sessions to prevent the processes from being killed and aborting the runs.

from nemo.collections.llm import evaluate
from nemo.collections.llm.evaluation.api import EvaluationConfig, ApiEndpoint, EvaluationTarget, ConfigParams

nemo_checkpoint = '/workspace/hf_llama3_8b_nemo2.nemo/'
api_endpoint = ApiEndpoint(nemo_checkpoint_path=nemo_checkpoint)
eval_target = EvaluationTarget(api_endpoint=api_endpoint)
eval_params = ConfigParams(top_p=1, temperature=1, top_k=1, limit_samples=2, num_fewshot=5)
eval_config = EvaluationConfig(type='mmlu', params=eval_params)

if __name__ == "__main__":
evaluate(target_cfg=eval_target, eval_cfg=eval_config)

Note

Please refer to deploy and evaluate method in nemo/collections/llm/api.py to check all the argument options as these are just sample commands and don’t share all arguments and their default settings. For more details on arguments in the ApiEndpoint and ConfigParams classes for evaluation, refer to nemo/collections/llm/evaluation/api.py.

Run Evaluations with NeMo-Run#

This section explains how to run evaluations with NeMo-Run. For detailed information about NeMo-Run, please refer to its documentation. Below is a concise guide focused on using NeMo-Run to perform evaluations in NeMo 2.0.

Launch Evaluations with NeMo-Run#

The evaluation.py. script serves as a reference for launching evaluations with NeMo-Run. This script demonstrates how to use NeMo-Run with both local executors (your local workstation) and Slurm-based executors like clusters. In this setup, the deploy and evaluate processes are launched as two separate jobs with NeMo-Run. The evaluate method waits until the PyTriton server is accessible and the model is deployed before starting the evaluations.

Run Locally with NeMo-Run#

To run evaluations on your local workstation, use the following command:

python scripts/llm/evaluation.py --nemo_checkpoint='/workspace/hf_llama3_8b_nemo2.nemo'

Note

When running locally with NeMo-Run, you will need to manually terminate the deploy process once evaluations are complete.

Run on Slurm-based Clusters#

To run evaluations on Slurm-based clusters, add the --slurm flag to your command and specify any custom parameters such as user, host, remote_job_dir, account, mounts, etc. Refer to the evaluation.py script for further details. Below is an example command:

python scripts/llm/evaluation.py --nemo_checkpoint='/workspace/hf_llama3_8b_nemo2.nemo' --slurm --nodes 1
--devices 8 --container_image "nvcr.io/nvidia/nemo:25.02" --tensor_parallelism_size 8

By following these commands, you can successfully run evaluations using NeMo-Run on both local and Slurm-based environments.