Model Evaluation

NVIDIA provides a simple tool to help evaluate trained checkpoints. You can evaluate the capabilities of the GPT model on the following ZeroShot downstream evaluation tasks:

  • lambada

  • boolq

  • race

  • piqa

  • hellaswag

  • winogrande

  • wikitext2

  • wikitext103

You must perform model evaluation using a training checkpoint (.ckpt format), not a converted checkpoint (.nemo format).

You must define the configuration used for the evaluation by setting the evaluation configuration in conf/config.yaml to specify the evaluation configuration file to be used. Set the configuration to gpt3/evaluate_all, which specifies the configuration file as conf/evaluation/gpt3/evaluate_all.yaml. You can modify the configuration to adapt different evaluation tasks and checkpoints in evaluation runs. For Base Command Platform, override all of these configuration from the command line.

You must include the evaluation value in stages to run the adapter learning pipeline.

Common

To configure the tasks to be run for evaluation, set the run.tasks configuration. Use the other run configurations to define the job-specific configuration:

Copy
Copied!
            

run: name: ${.eval_name}_${.model_train_name} time_limit: "4:00:00" nodes: ${divide_ceil:${evaluation.model.model_parallel_size}, 8} # 8 gpus per node ntasks_per_node: ${divide_ceil:${evaluation.model.model_parallel_size}, ${.nodes}} eval_name: eval_all model_train_name: gpt3_5b train_dir: ${base_results_dir}/${.model_train_name} tasks: all_tasks # supported: lambada, boolq, race, piqa, hellaswag, winogrande, wikitext2, wikitext103 OR all_tasks results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}

To specify the model checkpoint to load and its definition, use the model configuration:

Copy
Copied!
            

model: model_type: nemo-gpt3 checkpoint_folder: ${evaluation.run.train_dir}/results/checkpoints checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt) hparams_file: ${evaluation.run.train_dir}/results/hparams.yaml tensor_model_parallel_size: 2 #1 for 126m, 2 for 5b, 8 for 20b pipeline_model_parallel_size: 1 model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}} precision: bf16 # must match training precision - 32, 16 or bf16 eval_batch_size: 4 vocab_file: ${data_dir}/bpe/vocab.json merge_file: ${data_dir}/bpe/merges.txt

Slurm

Set the configuration for a Slurm cluster in conf/cluster/bcm.yaml:

Copy
Copied!
            

partition: null account: null exclusive: True gpus_per_task: null gpus_per_node: 8 mem: 0 overcommit: False job_name_prefix: "nemo-megatron-"

Example

To run only the evaluation pipeline and not the data preparation, training, conversion. or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!
            

stages: - evaluation

Then enter:

Copy
Copied!
            

python3 main.py

Base Command Platform

To run the evaluation script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. You can also override this configuration from the command line using hydra. This script must be launched in a multi-node job.

To run the evaluation pipeline to evaluate a 126M GPT model checkpoint stored in /mount/results/gpt3_126m/checkpoints, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=<evaluation> \ cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \ base_results_dir=/mount/results evaluation.model.vocab_file=/mount/data/bpe/vocab.json \ evaluation.model.merge_file=/mount/data/bpe/merges.txt evaluation.run.results_dir=/mount/results/gpt3_126m/evaluation \ evaluation.model.checkpoint_folder=/mount/results/gpt3_126m/results/checkpoints evaluation.model.eval_batch_size=16 \ evaluation.model.tensor_model_parallel_size=1 \ >> /results/eval_gpt3_log.txt 2>&1

Kubernetes

Set the configuration for a Slurm cluster in conf/cluster/k8s.yaml:

Copy
Copied!
            

pull_secret: null # Kubernetes secret for the container registry to pull private containers. shm_size: 512Gi # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes. nfs_server: null # Hostname or IP address for the NFS server where data is stored. nfs_path: null # Path to store data in the NFS server. ib_resource_name: "nvidia.com/hostdev" # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters. ib_count: "8" # Specify the number of IB devices to include per node in each pod.

Example

Set the cluster and cluster_type settings to k8s in conf/config.yaml.

To run only the evaluation pipeline and not the data preparation, training, conversion. or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!
            

stages: - evaluation

Then enter:

Copy
Copied!
            

python3 main.py

This will launch a Helm chart based on the evaluation configurations which will spawn a pod to evaluate the specified model. The pod can be viewed with kubectl get pods and the logs can be read with kubectl logs <pod name>.

NVIDIA provides a simple tool to help evaluate the prompt-learned GPT checkpoints. You can evaluate the capabilities of a prompt-learned GPT model on a customized prompt learning test dataset.

NVIDIA provides an example which evaluates a checkpoint that went through prompt learning on SQuAD v1.1, on the SQuAD v1.1 test dataset created in prompt learning step.

You must define the configuration used for the evaluation by setting the evaluation configuration in conf/config.yaml to specify the evaluation configuration file to be used. Set the configuration to prompt_gpt3/squad.yaml, which specifies the configuration file as conf/evaluation/prompt_gpt3/squad.yaml. The configurations can be modified to adapt to different evaluation tasks and checkpoints in evaluation runs. For Base Command Platform, override all of these configurations from the command line.

You must include the evaluation value in stages to run the adapter learning pipeline.

Common

Set the run.tasks configuration to prompt. Set the other run configurations to define the job-specific configuration:

Copy
Copied!
            

run: name: ${.eval_name}_${.model_train_name} time_limit: "4:00:00" nodes: ${divide_ceil:${evaluation.model.model_parallel_size}, 8} # 8 gpus per node ntasks_per_node: ${divide_ceil:${evaluation.model.model_parallel_size}, ${.nodes}} eval_name: eval_prompt_squad model_train_name: gpt3_5b tasks: "prompt" # general prompt task prompt_learn_dir: ${base_results_dir}/${.model_train_name}/prompt_learning_squad # assume prompt learning was on squad task results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}

To specify the model checkpoint to be loaded and which prompt learning test dataset to evaluate, set the model configuration:

Copy
Copied!
            

model: model_type: nemo-gpt3-prompt nemo_model: ${evaluation.run.prompt_learn_dir}/megatron_gpt_prompt.nemo tensor_model_parallel_size: 2 #1 for 126m, 2 for 5b, 8 for 20b pipeline_model_parallel_size: 1 model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}} precision: bf16 # must match training precision - 32, 16 or bf16 eval_batch_size: 4 prompt_dataset_paths: ${data_dir}/prompt_data/v1.1/squad_test.jsonl disable_special_tokens: False # Whether to disable virtual tokens in prompt model evaluation. This is equivalent to evaluate without prompt-/p-tuning.

Slurm

Set the configuration for a Slurm cluster in conf/cluster/bcm.yaml:

Copy
Copied!
            

partition: null account: null exclusive: True gpus_per_task: 1 gpus_per_node: null mem: 0 overcommit: False job_name_prefix: "nemo-megatron-"

Example

To run only the evaluation pipeline and not the data preparation, training, conversion, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!
            

stages: - evaluation

Then enter:

Copy
Copied!
            

python3 main.py

Base Command Platform

To run the evaluation script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. This config can be overidden from the command line using hydra. This script must be launched in a multi-node job.

To run the evaluation pipeline to evaluate a prompt-learned 5B GPT model checkpoint stored in /mount/results/gpt3_5b/checkpoints, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=<evaluation> evaluation=prompt_gpt3/squad \ cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \ base_results_dir=/mount/results evaluation.run.results_dir=/mount/results/gpt3_5b/eval_prompt_squad \ evaluation.model.nemo_model=/mount/results/gpt3_5b/prompt_learning_squad/results/megatron_gpt_prompt.nemo \ evaluation.model.nemo_model=4 evaluation.model.tensor_model_parallel_size=2 \ >> /results/eval_prompt_gpt3_log.txt 2>&1

The command above assumes that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/eval_prompt_gpt3_log.txt, which you can download from NGC. Any other required configuration may be added to modify the command’s behavior.

NVIDIA provides a simple tool to help evaluate the adapter and IA3 learned GPT checkpoints. You can evaluate the capabilities of the adapter-learned GPT model on a customized adapter learning test dataset.

NVIDIA provides an example to evaluate a checkpoint which went through adapter learning or IA3 learning on SQuAD v1.1.

Set the evaluation configuration in conf/config.yaml, which specifies the pathname of the evaluation configuration file. For adapter learning set the evaluation configuration to adapter_gpt3/squad.yml, which specifies the evaluation configuration file as conf/evaluation/adapter_gpt3/squad.yaml. For IA3 learning set the configuration to ia3_gpt3/squad.yml, which specifies the evaluation configuration file as conf/evaluation/ia3_gpt3/squad.yaml.

The evaluation configuration must be included in stages to run the evaluation pipeline.

The configurations can be modified to adapt to different evaluation tasks and checkpoints in evaluation runs. For Base Command Platform, all configurations must be overriden from the command line.

Common

To run evaluation on adapter learning test tasks, set the run.tasks configuration to adapter. Set the other run configurations to define the job-specific configuration:

Copy
Copied!
            

run: name: ${.eval_name}_${.model_train_name} time_limit: "4:00:00" nodes: ${divide_ceil:${evaluation.model.model_parallel_size}, 8} # 8 gpus per node ntasks_per_node: ${divide_ceil:${evaluation.model.model_parallel_size}, ${.nodes}} eval_name: eval_adapter_squad # or eval_ia3_squad model_train_name: gpt3_5b tasks: "adapter" # general adapter task adapter_learn_dir: ${base_results_dir}/${.model_train_name}/adapter_learning_squad # or ia3_learning_squad results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}

To specify the model checkpoint to be loaded and the adapter learning test dataset to be evaluated, set the model configurations:

Copy
Copied!
            

data: test_ds: - ${data_dir}/prompt_data/v1.1/squad_test.jsonl num_workers: 4 global_batch_size: 16 micro_batch_size: 16 tensor_model_parallel_size: 1 pipeline_model_parallel_size: 1 pipeline_model_parallel_split_rank: ${divide_floor:${.pipeline_model_parallel_size}, 2} model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}} language_model_path: ${base_results_dir}/${evaluation.run.model_train_name}/convert_nemo/results/megatron_gpt.nemo adapter_model_file: ${evaluation.run.adapter_learning_dir}/results/megatron_gpt_adapter.nemo # or megatron_gpt_ia3.nemo

Slurm

Set the configuration for a Slurm cluster in conf/cluster/bcm.yaml:

Copy
Copied!
            

partition: null account: null exclusive: True gpus_per_task: 1 gpus_per_node: null mem: 0 overcommit: False job_name_prefix: "nemo-megatron-"

Example

To run only the evaluation pipeline and not the data preparation, training, conversion, or inference pipelines. set the conf/config.yaml file to:

Copy
Copied!
            

stages: - evaluation

Then enter:

Copy
Copied!
            

python3 main.py

Base Command Platform

To run the evaluation pipeline on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. This configuration can be overridden from the command line using hydra. This script must be launched in a multi-node job.

To run the evaluation pipeline to evaluate an adapter-learned 220M T5 model checkpoint stored in /mount/results/gpt3_5b/adapter_learning_squad, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=<evaluation> evaluation=adapter_gpt3/squad \ cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \ base_results_dir=/mount/results evaluation.run.results_dir=/mount/results/gpt3_5b/eval_adapter_squad \ evaluation.model.adapter_model_file=/mount/results/gpt3_5b/adapter_learning_squad/results/megatron_gpt3_adapter.nemo \ >> /results/eval_adapter_gpt3_log.txt 2>&1

The command above assumes that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/eval_adapter_gpt3_log.txt, which you can download from NGC. Any other required configuration may be added to modify the command’s behavior.

To run the evaluation pipeline to evaluate an IA3-learned 220M T5 model checkpoint stored in /mount/results/gpt3_5b/ia3_learning_squad, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=<evaluation> evaluation=ia3_gpt3/squad \ cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \ base_results_dir=/mount/results evaluation.run.results_dir=/mount/results/gpt3_5b/eval_ia3_squad \ evaluation.model.adapter_model_file=/mount/results/gpt3_5b/ia3_learning_squad/results/megatron_t5_ia3.nemo \ >> /results/eval_ia3_t5_log.txt 2>&1

The command above assumes that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/eval_ia3_t5_log.txt, which you can download from NGC. Any other required configuration may be added to modify the command’s behavior.

Previous Checkpoint Conversion
Next Parameter Efficient Fine-Tuning (PEFT)
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.