Working With the Containers
TAO encapsulates DNN training pipelines that may be developed across different training frameworks. To isolate dependencies and training environments, these DNN applications are housed in different containers. The TAO Launcher abstracts the details of which network is associated with which container. However, it requires you to run TAO from an environment where docker containers can be instantiated by the launcher. This requires elevated user privileges or a Docker IN Docker (DIND) setup to call a Docker from within your container. This may not be ideal in several scenarios, such as:
Running on a remote cluster where the SLURM instantiates a container on the provisioned cluster node
Running on a machine without elevated user privileges
Running multi-node training jobs
Running on an instance of a Multi-Instanced supported GPU (MiG)
To run the DNNs from one of the multiple enclosed containers, you first need to know which networks are housed in which container. A simple way to get this information is to install the TAO Launcher on your local machine and running tao info –verbose, enclosed across multiple containers.
The following is sample output from TAO 5.0.1:
Configuration of the TAO Instance
task_group:
model:
dockers:
nvidia/tao/tao-toolkit:
5.0.0-tf2.9.1:
docker_registry: nvcr.io
tasks:
1. classification_tf2
2. efficientdet_tf2
5.0.0-tf1.15.5:
docker_registry: nvcr.io
tasks:
1. bpnet
2. classification_tf1
3. converter
4. detectnet_v2
5. dssd
6. efficientdet_tf1
7. faster_rcnn
8. fpenet
9. lprnet
10. mask_rcnn
11. multitask_classification
12. retinanet
13. ssd
14. unet
15. yolo_v3
16. yolo_v4
17. yolo_v4_tiny
5.0.0-pyt:
docker_registry: nvcr.io
tasks:
1. action_recognition
2. centerpose
3. classification_pyt
4. deformable_detr
5. dino
6. mal
7. ml_recog
8. ocdnet
9. ocrnet
10. optical_inspection
11. pointpillars
12. pose_classification
13. re_identification
14. re_identification_transformer
15. segformer
16. visual_changenet
dataset:
dockers:
nvidia/tao/tao-toolkit:
5.0.0-dataservice:
docker_registry: nvcr.io
tasks:
1. augmentation
2. auto_label
3. annotations
4. analytics
deploy:
dockers:
nvidia/tao/tao-toolkit:
5.0.0-deploy:
docker_registry: nvcr.io
tasks:
1. centerpose
2. classification_pyt
3. classification_tf1
4. classification_tf2
5. deformable_detr
6. detectnet_v2
7. dino
8. dssd
9. efficientdet_tf1
10. efficientdet_tf2
11. faster_rcnn
12. lprnet
13. mask_rcnn
14. ml_recog
15. multitask_classification
16. ocdnet
17. ocrnet
18. optical_inspection
19. retinanet
20. segformer
21. ssd
22. unet
23. visual_changenet
24. yolo_v3
25. yolo_v4
26. yolo_v4_tiny
format_version: 3.0
toolkit_version: 5.0.0
published_date: 05/31/2023
The container name associated with the task can be derived as $DOCKER_REGISTRY/$DOCKER_NAME:$DOCKER_TAG
.
For example, from the log above, the Docker name to run detectnet_v2
can be derived as follows:
export DOCKER_REGISTRY="nvcr.io"
export DOCKER_NAME="nvidia/tao/tao-toolkit"
export DOCKER_TAG="5.0.0-tf1.15.5"
export DOCKER_CONTAINER=$DOCKER_REGISTRY/$DOCKER_NAME:$DOCKER_TAG
Once you have the Docker name, invoke the container by running the commands defined by the network without the :code:`tao` prefix. For example, the following command will run a detectnet_v2 training job with 4 GPUs:
docker run -it --rm --gpus all \
-v /path/in/host:/path/in/docker \
$DOCKER_CONTAINER \
detectnet_v2 train -e /path/to/experiment/spec.txt \
-r /path/to/results/dir \
-k $KEY --gpus 4
From 3.0-21.11, TAO supports multi-node training for the following CV models:
Image classification
Multi-task classification
Detectnet_v2
FasterRCNN
SSD
DSSD
YOLOv3
YOLOv4
YOLOv4-Tiny
RetinaNet
MaskRCNN
EfficientDet
UNet
For these networks, the only task that can run multi-node training is train
. To invoke multi-node training,
simply add the --multi-node
argument to the train command.
For example, the multi-GPU training command given above can be issued as a multi-node command:
detectnet_v2 train -e /path/to/experiment/spec.txt \
-r /path/to/results/dir \
-k $KEY \
--gpus 4 \
--multi-node
TAO uses OPEN-MPI + HOROVOD to orchestrate multi-GPU and multi-node training. By default, the following arguments are appended to the mpirun
command:
-x NCCL_IB_HCA=mlx5_4,mlx5_6,mlx5_8,mlx5_10 -x NCCL_SOCKET_IFNAME=^lo,docker
To add more arguments to the mpirun
command, add them to the --mpirun-arg
of the train
command, as shown in the following example:
When running multi-node training, the entire dataset must be visible to all nodes running the training. If the data is not present, training jobs may crash with errors stating that the data couldn’t be found.
For example, if you have a .tar
dataset that has been downloaded to one of the nodes (rank 0) in a
multi-node job with two nodes and eight GPUs each, a simple way to extract the data would be to run it as a multi
node process using mpirun
:.
mpirun -np 16 --allow-run-as-root bash -c 'if [[ $OMPI_COMM_WORLD_LOCAL_RANK -eq 0 ]]; then set -x && tar -xf dataset.tar -C /raid; fi '
NVIDIA Multi-Instance GPU (MIG) expands the performance and value of data-center class GPUs–namely, the NVIDIA H100, A100 and A30 Tensor Core GPUs–by allowing users to partition a single GPU into as many as seven instances, each with its own fully isolated high-bandwidth memory, cache, and compute cores. For more information on setting up MIG, please refer the NVIDIA Multi-Instance GPU User Guide.
Read the supported configurations in the MIG document to understand the best way to split and improve utilization.
The following sample command runs a DetectNet_v2 training session on a MIG-enabled GPU.
docker run -it --rm --runtime=nvidia \
-e NVIDIA_VISIBLE_DEVICES="MIG-<DEVICE_UUID>,MIG-<DEVICE_UUID>" \
-v /path/in/host:/path/in/container \
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 \
detectnet_v2 train -e /path/to/experiment.txt \
-k <key> \
-r /path/to/store/results \
-n <name of trained model>
You must add the --runtime=nvidia
flag to the docker
command and export the
NVIDIA_VISIBLE_DEVICES
environment variable with the UUID of the GPU instance. You can get the UUID of the
specific MIG instance by running the nvidia-smi -L
command.
Running TAO via the TAO Launcher requires the user to have docker-ce
installed since the
launcher interacts with the Docker service on the local host to run the commands. Installing Docker
requires elevated user privileges to run as root. If you don’t have elevated user privileges
on your compute machine, you may run TAO using Singularity.
This requires you to bypass the tao-launcher
and interact directly with the component docker containers. For information on
which tasks are implemented in different Dockers, run the tao info --verbose
command. Once you have derived
the task-to-Docker mapping, you may run the tasks using the following steps:
Pull the required Docker using the following
singularity
command:singularity pull tao-toolkit-tf:5.0.0-tf1.15.5 py3.sif docker://nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 .. Note:: For this command to work, the latest version of singularity must be installed.
Instantiate the Docker using the following command:
singularity run --nv -B /path/to/workspace:/path/to/workspace tao-toolkit-tf:5.0.0-tf1.15.5.sif
Run the commands inside the container without the
tao
prefix. For example, to run adetectnet_v2
training in thetao-toolkit-tf
container, use the following command:
detectnet_v2 train -e /path/to/workspace/specs/file.txt \
-k $KEY \
-r /path/to/workspace/results \
-n name_of_final_model \
--gpus $NUM_GPUS