Evaluate a Trained Model#
After training a model, you may want to run evaluation to understand how the model performs on unseen tasks. You can use Eleuther AI’s Language Model Evaluation Harness to quickly run a variety of popular benchmarks, including MMLU, SuperGLUE, HellaSwag, and WinoGrande. A full list of supported tasks can be found here.
Install the LM Evaluation Harness#
Run the following commands inside of a NeMo container to install the LM Evaluation Harness:
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
Run Evaluations#
A detailed description of running evaluation with .nemo
models can be found in Eleuther AI’s documentation.
Single- and multi-GPU evaluation is supported. The following is an example of running evaluation using 8 GPUs on the hellaswag
, super_glue
, and winogrande
tasks using a .nemo
file from NeMo-Aligner.
Please note that while it is recommended, you are not required to unzip your .nemo file before running evaluations.
mkdir unzipped_checkpoint
tar -xvf /path/to/model.nemo -c unzipped_checkpoint
torchrun --nproc-per-node=8 --no-python lm_eval --model nemo_lm \
--model_args path='unzipped_checkpoint',devices=8,tensor_model_parallel_size=8 \
--tasks lambada_openai,super-glue-lm-eval-v1,winogrande \
--batch_size 8