Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Model Evaluation

NVIDIA provides a simple tool to help evaluate trained checkpoints. You can evaluate the capabilities of the Qwen2 model on the following ZeroShot downstream evaluation tasks:

lambada, boolq, race, piqa, hellaswag, winogrande, wikitext2, wikitext103

Fine-tuned Qwen2 models can be evaluated on the following tasks:

squad

Run Evaluation

To run evaluation, update conf/config.yaml:

defaults:
  - evaluation: qwen2/evaluate_all.yaml

stages:
  - evaluation

Execute the launcher pipeline: python3 main.py.

Configure Settings

You can find default configurations for evaluation in conf/evaluation/qwen2/evaluate_all.yaml

To configure:

run:
    name: ${.eval_name}_${.model_train_name}
    time_limit: "4:00:00"
    nodes: ${divide_ceil:${evaluation.model.model_parallel_size}, 8} # 8 gpus per node
    ntasks_per_node: ${divide_ceil:${evaluation.model.model_parallel_size}, ${.nodes}}
    eval_name: eval_all
    model_train_name: qwen2_7b
    train_dir: ${base_results_dir}/${.model_train_name}
    tasks: all_tasks
    results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}

tasks sets the evaluation task to execute. Supported tasks include: lambada, boolq, race, piqa, hellaswag, winogrande, wikitext2, wikitext103, all_tasks. all_tasks executes all supported evaluation tasks.

model:
    model_type: nemo-qwen2
    nemo_model: null # specify path to .nemo file, produced when converted interleaved checkpoints
    tensor_model_parallel_size: 1
    pipeline_model_parallel_size: 1
    model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
    precision: bf16 # must match training precision - 32, 16 or bf16
    eval_batch_size: 4

nemo_model sets the path to .nemo checkpoint to run evaluation.