Step #4: Training Using Base Command Platform

This step will walk through creating a job and sample commands for running 1, 2, 4 and 8 GPU systems

This training code is contained in the nv-launchpad-bc:brats-monai-lab container.

brats_training_ddp.py contains the script being used to train the models. Flags are included to allow you to change arguments and play with the model development. Below are some sample commands for single and multi-gpu model jobs on Base Command Platform.

--dir sets the dataset directory. This is defaulted to /mount/workspace/brats2021

--epochs sets the median number of total epochs to run. This is defaulted to 300 epochs

--lr sets the learning rate. This is defaulted to 1e-4.

--batch_size sets the batch size. This is defaulted to a batch size of 1

--seed sets the seed for initializing training. This defaults to None

--cache_rate sets the cache. A larger cache will need more GPU memory to store the transformed images. This defaults to 0.1

--val_interval sets the validation interval. This defaults to every 20 epochs

--network defaults to SegResNet but can be either UNet or SegResNet

--wandb defaults to True to use Weights and Biases for visualization

Note

Play around with these arguments and see how they affect your model speed and accuracy!

Single GPU Training

Important

This part of the lab will need to be completed using the desktop that is accessible from the left-hand navigation pane.

Log into NVIDIA NGC by visiting https://ngc.nvidia.com/signin
Expand the Base Command section by clicking the downward facing chevron and select Dashboard
Click Create Job.
Select your Accelerated Computing Environment (ACE)
Set the instance size to dgxa100.80g.1.norm
Select the Workspaces tab, select the workspace you created in step 1, and set the mount point to /mount/workspace
Set the result mountpoint to /results
Select the nv-launchpad-bc:brats-monai-lab container from the dropdown and the 1.0 tag

Enter the command below in Run Command to run a 1 GPU job with WandB MLOps visualization integrated

Copy
Copied!

            
            wandb login <api key>;
torchrun --nproc_per_node=1 --nnodes=1 /workspace/brats-monai-curated-lab/brats_training_ddp.py --epochs=300 --val_interval=10 --cache_rate=0.15

to run without WandB visualization, add the --wandb=False flag

Copy
Copied!

            
            torchrun --nproc_per_node=1 --nnodes=1 /workspace/brats-monai-curated-lab/brats_training_ddp.py --epochs=300 --val_interval=5 --cache_rate=0.15 --wandb=False

Rename the job to single gpu BraTS training

Note

Once you have filled in all the required fields, you can copy the command at the bottom of the Create Job page and paste it into your terminal to run the job via the NGC CLI tool.
Wait for the job to run to completion, this should take about 37 hours
- If you do not wish to run the job to completion, click the kebab menu on the right of the Results button and then click Kill Job.

Note

While waiting, Base Command can run other jobs concurrently. Try running some of the multi GPU jobs at the same time!

Multi GPU Training

To run the job on multiple GPUs using PyTorch distributed data-parallel training, follow the single GPU job creation process, and update the torchrun command and change the ACE to have the required amount of GPUs.

The following examples, optimize GPU Utilization by increasing the amount of data cached on each GPU. You can see that as we increase GPUs, the increase in GPU memory allows for more of the dataset to be stored on GPU memory.

Two GPUs

Copy
Copied!

            
            torchrun --nproc_per_node=2 --nnodes=1 /workspace/brats-monai-curated-lab/brats_training_ddp.py --epochs=300 --val_interval=5 --cache_rate=0.375

Four GPUs

Copy
Copied!

            
            torchrun --nproc_per_node=4 --nnodes=1 /workspace/brats-monai-curated-lab/brats_training_ddp.py --epochs=300 --val_interval=5 --cache_rate=0.75

Eight GPUs

Copy
Copied!

            
            torchrun --nproc_per_node=8 --nnodes=1 /workspace/brats-monai-curated-lab/brats_training_ddp.py --epochs=300 --val_interval=5 --cache_rate=1.0

Analysis

When going through the lab, Weights and Biases is a great tool for analyzing important metrics for your model. You should get similar metrics as what is shown below when running the different training scripts.

As data gets distributed across multiple GPUs, the training time goes down significantly.

The time it takes to get to a benchmark of 80% mean dice accuracy also decreases with an increase in GPUs. It is important to note that the data is not shuffled between epochs. Data shuffling may increase the accuracy of the model faster due to more data being introduced to the individual GPUs.

Weights and Biases allows you to run jobs on Base Command and visualize them in real time to see the loss and accuracy of the model. This allows you to compare multiple models against eachother in real time. The code is already set up to showcase the training and validation loss, as well as the mean accuracy and individual label accuracies. The x axis is currently set to show the accuracy over time; however, this can be changed to compare other values such as the epoch number to the y axis.

If you opt into using Weights and Biases, it will also generate system analytics such as shown below. This can help you determine how the model is affecting your GPUs. Some of the below graphs have been customized to take an exponential moving average to visualize the average system usage and utilization over time. A higher GPU utilization may allow for your model to train faster. This can be potentially be done by caching more images on GPU Memory or increasing the batch size.