Step #4: Training Using Base Command Platform

This step will walk through creating a job and sample commands for running 1, 2, 4 and 8 GPU systems

This training code is contained in the nv-launchpad-bc:brats-monai-lab container.

brats_training_ddp.py contains the script being used to train the models. Flags are included to allow you to change arguments and play with the model development. Below are some sample commands for single and multi-gpu model jobs on Base Command Platform.

--dir sets the dataset directory. This is defaulted to /mount/workspace/brats2021

--epochs sets the median number of total epochs to run. This is defaulted to 300 epochs

--lr sets the learning rate. This is defaulted to 1e-4.

--batch_size sets the batch size. This is defaulted to a batch size of 1

--seed sets the seed for initializing training. This defaults to None

--cache_rate sets the cache. A larger cache will need more GPU memory to store the transformed images. This defaults to 0.1

--val_interval sets the validation interval. This defaults to every 20 epochs

--network defaults to SegResNet but can be either UNet or SegResNet

--wandb defaults to True to use Weights and Biases for visualization

Note

Play around with these arguments and see how they affect your model speed and accuracy!

Important

This part of the lab will need to be completed using the desktop that is accessible from the left-hand navigation pane.

  1. Log into NVIDIA NGC by visiting https://ngc.nvidia.com/signin

  2. Expand the Base Command section by clicking the downward facing chevron and select Dashboard

    base-command-007.png

  3. Click Create Job.

    base-command-011.png

  4. Select your Accelerated Computing Environment (ACE)

    base-command-012.png

  5. Set the instance size to dgxa100.80g.1.norm

    base-command-013.png

  6. Select the Workspaces tab, select the workspace you created in step 1, and set the mount point to /mount/workspace

    step-02-image-03.png

  7. Set the result mountpoint to /results

    base-command-015.png

  8. Select the nv-launchpad-bc:brats-monai-lab container from the dropdown and the 1.0 tag

    step-02-image-04.png

  9. Enter the command below in Run Command to run a 1 GPU job with WandB MLOps visualization integrated

    Copy
    Copied!
                

    wandb login <api key>; torchrun --nproc_per_node=1 --nnodes=1 /workspace/brats-monai-curated-lab/brats_training_ddp.py --epochs=300 --val_interval=10 --cache_rate=0.15

    • to run without WandB visualization, add the --wandb=False flag

    Copy
    Copied!
                

    torchrun --nproc_per_node=1 --nnodes=1 /workspace/brats-monai-curated-lab/brats_training_ddp.py --epochs=300 --val_interval=5 --cache_rate=0.15 --wandb=False


  10. Rename the job to single gpu BraTS training

    step-04-image-03.png
    Note

    Once you have filled in all the required fields, you can copy the command at the bottom of the Create Job page and paste it into your terminal to run the job via the NGC CLI tool.


  11. Wait for the job to run to completion, this should take about 37 hours

    • If you do not wish to run the job to completion, click the kebab menu on the right of the Results button and then click Kill Job.

    step-04-image-06.png


Note

While waiting, Base Command can run other jobs concurrently. Try running some of the multi GPU jobs at the same time!

To run the job on multiple GPUs using PyTorch distributed data-parallel training, follow the single GPU job creation process, and update the torchrun command and change the ACE to have the required amount of GPUs.

The following examples, optimize GPU Utilization by increasing the amount of data cached on each GPU. You can see that as we increase GPUs, the increase in GPU memory allows for more of the dataset to be stored on GPU memory.

Two GPUs

Copy
Copied!
            

torchrun --nproc_per_node=2 --nnodes=1 /workspace/brats-monai-curated-lab/brats_training_ddp.py --epochs=300 --val_interval=5 --cache_rate=0.375

Four GPUs

Copy
Copied!
            

torchrun --nproc_per_node=4 --nnodes=1 /workspace/brats-monai-curated-lab/brats_training_ddp.py --epochs=300 --val_interval=5 --cache_rate=0.75

Eight GPUs

Copy
Copied!
            

torchrun --nproc_per_node=8 --nnodes=1 /workspace/brats-monai-curated-lab/brats_training_ddp.py --epochs=300 --val_interval=5 --cache_rate=1.0

When going through the lab, Weights and Biases is a great tool for analyzing important metrics for your model. You should get similar metrics as what is shown below when running the different training scripts.

As data gets distributed across multiple GPUs, the training time goes down significantly.

step-04-image-01.png

The time it takes to get to a benchmark of 80% mean dice accuracy also decreases with an increase in GPUs. It is important to note that the data is not shuffled between epochs. Data shuffling may increase the accuracy of the model faster due to more data being introduced to the individual GPUs.

step-04-image-02.png

Weights and Biases allows you to run jobs on Base Command and visualize them in real time to see the loss and accuracy of the model. This allows you to compare multiple models against eachother in real time. The code is already set up to showcase the training and validation loss, as well as the mean accuracy and individual label accuracies. The x axis is currently set to show the accuracy over time; however, this can be changed to compare other values such as the epoch number to the y axis.

step-04-image-05.png

If you opt into using Weights and Biases, it will also generate system analytics such as shown below. This can help you determine how the model is affecting your GPUs. Some of the below graphs have been customized to take an exponential moving average to visualize the average system usage and utilization over time. A higher GPU utilization may allow for your model to train faster. This can be potentially be done by caching more images on GPU Memory or increasing the batch size.

step-04-image-04.png

© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.