NVIDIA Docs Hub Homepage Train a Large-Scale NLP Model with NeMo Megatron Step #5: Evaluate the BigNLP Model

Step #5: Evaluate the BigNLP Model

The BigNLP scripts include an evaluation harness. It is a simple tool to help evaluate the trained checkpoints. One can evaluate the capabilities of the GPT-3 model on the following ZeroShot downstream evaluation tasks: boolq, hellaswag, lambada, race, piqa, winogrande, wikitext103, and wikitext2.

Use the NGC batch command below to evaluate the data.

Copy
Copied!

            
            ngc batch run \
  --name "bignlp_eval_all_126m_bf16" \
  --org nv-launchpad-bc \
  --team no-team \
  --ace nv-launchpad-bc-iad1 \
  --instance dgxa100.80g.8.norm \
  --replicas 2 \
  --array-type PYTORCH \
  --image "nvcr.io/nv-launchpad-bc/bignlp-training:22.02-py3" \
  --result /results \
  --workspace jdoe_workspace:/mount_workspace:RW \
  --total-runtime 1h \
  --commandline "\
set -x && \
python3 /opt/bignlp/bignlp-scripts/main.py \
run_data_preparation=False \
run_training=False \
run_conversion=False \
run_evaluation=True \
cluster_type=bcp \
bignlp_path=/opt/bignlp/bignlp-scripts \
base_results_dir=/mount_workspace/results \
evaluation.model.vocab_file=/mount_workspace/data/bpe/vocab.json \
evaluation.model.merge_file=/mount_workspace/data/bpe/merges.txt \
evaluation.run.results_dir=/mount_workspace/results/gpt3_126m/eval_all \
evaluation.run.train_dir=/mount_workspace/results/gpt3_126m \
evaluation.model.eval_batch_size=16 \
evaluation.model.checkpoint_name=megatron_gpt*last.ckpt \
evaluation.model.tensor_model_parallel_size=1 \
> >(tee -a /results/eval_bf16_log.txt) 2>&1 && \
find /mount_workspace/results/gpt3_126m/eval_all -name metrics.json -exec cp {} /results/metrics.json \\;"

Locate the results file from NGC batch command.

The results will be saved in a metrics metrics.json file. The file is copied to the workspace, so either download it from the workspace or from the /results directory of the job.

The ngc batch command produced the following metrics.json.

Copy
Copied!

            
            {
  "boolq": {
    "acc": 0.6048929663608563,
    "acc_stderr": 0.008550454248280895
  },
  "hellaswag": {
    "acc": 0.28211511651065524,
    "acc_stderr": 0.004491093528113424,
    "acc_norm": 0.2969527982473611,
    "acc_norm_stderr": 0.004559817589182073
  },
  "lambada": {
    "ppl": 23.718865098431262,
    "ppl_stderr": 0.8686846521536599,
    "acc": 0.4075295944110227,
    "acc_stderr": 0.0068458121817245796
  },
  "piqa": {
    "acc": 0.6115342763873776,
    "acc_stderr": 0.011371877593210252,
    "acc_norm": 0.6109902067464635,
    "acc_norm_stderr": 0.011374774974447464
  },
  "race": {
    "acc": 0.2727272727272727,
    "acc_stderr": 0.013783600792343722,
    "acc_norm": 0.35789473684210527,
    "acc_norm_stderr": 0.014836467904073716
  },
  "winogrande": {
    "acc": 0.5059194948697711,
    "acc_stderr": 0.014051500838485807
  },
  "wikitext103": {
    "word_perplexity": 31.284774148752945,
    "byte_perplexity": 1.903858863780367,
    "bits_per_byte": 0.64388280745592
  },
  "wikitext2": {
    "word_perplexity": 31.284772851165563,
    "byte_perplexity": 1.9038588490133912,
    "bits_per_byte": 0.6438827996995804
  }
}

Compare the results in the metrics.json file to the values in the following table.

The results should be within 2% of those in the table.