Step #3: Training Using NGC Batch Run

Training NGC Batch Command

This code is the complete NGC batch command for training the NeMo Megatron model. It is explained in the following sections.

Copy
Copied!

            
            ngc batch run \
  --name "bignlp_training_126m_bf16" \
  --org nv-launchpad-bc \
  --team no-team \
  --ace nv-launchpad-bc-iad1 \
  --instance dgxa100.80g.8.norm \
  --image "nvcr.io/nv-launchpad-bc/bignlp-training:22.02-py3" \
  --result /results \
  --workspace jdoe_workspace:/mount_workspace:RW \
  --replicas 4 \
  --array-type PYTORCH \
  --total-runtime 4D \
  --commandline "\
set -x && \
python3 /opt/bignlp/bignlp-scripts/main.py \
cluster_type=bcp \
training=gpt3/126m \
training_config=gpt3/126m \
run_training=True \
run_data_preparation=False \
run_conversion=False \
run_evaluation=False \
bignlp_path=/opt/bignlp/bignlp-scripts \
data_dir=/mount_workspace/data/the_pile \
base_results_dir=/mount_workspace/results \
training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount_workspace/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount_workspace/data/bpe/merges.txt \
> >(tee -a /results/train_log.log) \
2> >(tee -a /results/train_stderr.log >&2) && \
rsync -P -rvh /mount_workspace/results /results"

The workspace is mounted in the job using --workspace jdoe_workspace:/mount_workspace:RW. The dataset files are large, and depending on the size of the dataset, it might not be practical to store the data in the workspace. Large datasets could optionally be stored in a dataset object on BCP and symlinked to the workspace, but for the purpose of this lab, all data resides in the workspace for simplicity.

Training

Several GPT-3 model configurations are provided in the NeMo Megatron scripts, including: 126m, 5B, 20B, and larger. These configurations include carefully selected hyperparameters, which should be used as a guideline for any custom model configurations. All these configurations are provided in the conf/training/gpt3/ directory of the bignlp code (within the bignlp container that the code resides in /opt/bignlp/bignlp-scripts).

General training time requirements are one and a half days for the 126m model using eight nodes, six days for the 5B model using 20 nodes, and seven days for the 20B model using 80 nodes. Each node consists of eight A100 GPUs such as a DGX A100 system. Due to LaunchPad resource limitations and time constraints the 126m model training instructions are shown below. Nevertheless, these would be the same steps to apply for training larger models.

The 126m model training by default uses eight nodes with eight GPUs per node and BF16 precision. The model includes 12 transformer layers, a hidden size of 768, and 12 attention heads. The sequence length is 2048, and the optimizer is Adam. This model does not use any model parallelism.

Note

The training with four nodes takes around four days.

The training portion of the command can be found below.

Copy
Copied!

            
            python3 /opt/bignlp/bignlp-scripts/main.py \
  cluster_type=bcp \
  training=gpt3/126m \
  training_config=gpt3/126m \
  run_training=True \
  run_data_preparation=False \
  run_conversion=False \
  run_evaluation=False \
  bignlp_path=/opt/bignlp/bignlp-scripts \
  data_dir=/mount_workspace/data/the_pile \
  base_results_dir=/mount_workspace/results \
  training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
  training.model.tokenizer.vocab_file=/mount_workspace/data/bpe/vocab.json \
  training.model.tokenizer.merge_file=/mount_workspace/data/bpe/merges.txt \
  > >(tee -a /results/train_log.log) \
  2> >(tee -a /results/train_stderr.log >&2)

While the training is running, the trained/checkpointed model is saved in the /mount_workspace/results/gpt3_126m/checkpoints subdirectory. The latest trained checkpoint file will have the name ending as *-last.ckpt. The training log can be found at /mount_workspace/results/gpt3_126m/gpt3_126m.log.

Once the training is done the final trained model can optionally be uploaded and shared using the NGC model registry. Either the web UI or CLI can be used to upload the model to the NGC model registry. Note that the remainder of the lab will not rely on models being saved to the NGC model registry.

(Optional) Download the trained model from the results or workspace with the command below.

Copy
Copied!

            
            # download final trained model
ngc result download \
  --org nv-launchpad-bc \
  --team no-team \
  --ace nv-launchpad-bc-iad1 \
  --file /gpt3_126m/checkpoints/*-last.ckpt \
  <training-job-id>

# or from workspace
ngc workspace download \
  --org nv-launchpad-bc \
  --team no-team \
  --ace nv-launchpad-bc-iad1 \
  --file results/gpt3_126m/checkpoints/<full filename of *-last.ckpt file> \
  jdoe_workspace

(Optional) To upload the downloaded model to the NGC model registry, refer to model registry instructions.

Resume Training

If the training ends prematurely it is possible to resume training from saved checkpoints. Training might get interrupted due to hardware failure, or software such as BCP downtime, or a timeout fetching data.

Since the training checkpoints are being saved in the workspace, just start training as before using the command at the beginning of the step.

As before, trained checkpoints will be saved in results/gpt3_126m/checkpoints inside the workspace.

Step #3: Training Using NGC Batch Run

Training NGC Batch Command

NGC Batch Run Mount Setup

Training

Resume Training