Model Evaluation

NVIDIA provides a simple tool to help evaluate trained checkpoints. You can evaluate the capabilities of the Nemotron model on the following ZeroShot downstream evaluation tasks: lambada, boolq, race, piqa, hellaswag, winogrande, wikitext2, wikitext103

You can also evaluate fine-tuned Nemotron models on squad tasks.

Run Evaluation

To run evaluation, update conf/config.yaml:

defaults:
  - evaluation: nemotron/evaluate_all.yaml

stages:
  - evaluation

Execute the launcher pipeline: python3 main.py

Configuration Evaluation

You can find default configurations for evaluation in conf/evaluation/nemotron/evaluate_all.yaml.

To configure evaluation, run the following:

run:
    name: ${.eval_name}_${.model_train_name}
    time_limit: "4:00:00"
    nodes: ${divide_ceil:${evaluation.model.model_parallel_size}, 8} # 8 gpus per node
    ntasks_per_node: ${divide_ceil:${evaluation.model.model_parallel_size}, ${.nodes}}
    eval_name: eval_all
    model_train_name: nemotron
    train_dir: ${base_results_dir}/${.model_train_name}
    tasks: all_tasks
    results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}

The tasks parameter sets the evaluation task to execute.. Supported tasks include: lambada, boolq, race, piqa, hellaswag, winogrande, wikitext2, wikitext103, all_tasks. all_tasks executes all supported evaluation tasks.

Set the appropriate model parallel sizes. For nemotron 340B, use the following values:

model:
    model_type: nemo-nemotron
    nemo_model: null # specify path to .nemo file, produced when converted interleaved checkpoints
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
    model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
    precision: bf16 # must match training precision - 32, 16 or bf16
    eval_batch_size: 4

The nemo_model parameter sets the path to .nemo checkpoint to run evaluation.

Run Evaluation on PEFT Nemotron Models

To run evaluation on PEFT Nemotron models, update conf/config.yaml:

defaults:
  - evaluation: peft_nemotron/squad.yaml

stages:
  - evaluation

Execute launcher pipeline: python3 main.py

Configuration Evaluation

You can find default configurations for PEFT Nemotron evaluation in conf/evaluation/peft_nemotron/squad.yaml

To configure evaluation, run the following:

run:
  name: eval_${.task_name}_${.model_train_name}
  time_limit: "04:00:00"
  dependency: "singleton"
  convert_name: convert_nemo
  model_train_name: nemotron
  task_name: "squad"  # SQuAD v1.1
  convert_dir: ${base_results_dir}/${.model_train_name}/${.convert_name}
  fine_tuning_dir: ${base_results_dir}/${.model_train_name}/peft_${.task_name}
  results_dir: ${base_results_dir}/${.model_train_name}/peft_${.task_name}_eval

Set PEFT-specific configurations:

peft:
  peft_scheme: "ptuning"  # can be either adapter,ia3, or ptuning
  restore_from_path: ${evaluation.run.fine_tuning_dir}/${.peft_scheme}/megatron_nemotron_peft_tuning-${.peft_scheme}/checkpoints/megatron_nemotron_peft_tuning-{.peft_scheme}.nemo

The peft_scheme parameter sets the scheme used during fine-tuning.

The restore_from_path parameter specifies the path to the PEFT checkpoint for evaluation.