NVIDIA TAO Toolkit v4.0
NVIDIA TAO Release tlt.40

Working With the Containers

TAO Toolkit encapsulates DNN training pipelines that may be developed across different training frameworks. To isolate dependencies and training environments, these DNN applications are housed in different containers. The TAO Toolkit Launcher abstracts the details of which network is associated with which container. However, it requires you to run TAO Toolkit from an environment where docker containers can be instantiated by the launcher. This requires elevated user privileges or a Docker IN Docker (DIND) setup to call a Docker from within your container. This may not be ideal in several scenarios, such as:

  • Running on a remote cluster where the SLURM instantiates a container on the provisioned cluster node

  • Running on a machine without elevated user privileges

  • Running multi-node training jobs

To run the DNNs from one of the multiple enclosed containers, you first need to know which networks are housed in which container. A simple way to get this information is to install the TAO Toolkit Launcher on your local machine and running tao info –verbose, enclosed across multiple containers.

The following is sample output from TAO 4.0.0:


Configuration of the TAO Toolkit Instance dockers: nvidia/tao/tao-toolkit: 4.0.0-tf2.9.1 docker_registry: nvcr.io tasks: 1. classification_tf2 2. efficientdet_tf2 4.0.0-tf1.15.5: docker_registry: nvcr.io tasks: 1. augment 2. bpnet 3. classification_tf1 4. dssd 5. emotionnet 6. efficientdet_tf1 7. fpenet 8. gazenet 9. gesturenet 10. heartratenet 11. lprnet 12. mask_rcnn 13. multitask_classification 14. retinanet 15. ssd 16. unet 17. yolo_v3 18. yolo_v4 19. yolo_v4_tiny 20. converter 21. detectnet_v2 22. faster_rcnn 4.0.0-pyt: docker_registry: nvcr.io tasks: 1. speech_to_text 2. speech_to_text_citrinet 3. text_classification 4. question_answering 5. token_classification 6. intent_slot_classification 7. punctuation_and_capitalization 8. action_recognition 9. spectro_gen 10. vocoder 11. deformable_detr 12. segformer 13. re_identification 14. pose_classification 15. n_gram format_version: 2.0 toolkit_version: 4.0.0 published_date: 12/06/2022

The container name associated with the task can be derived as $DOCKER_REGISTRY/$DOCKER_NAME:$DOCKER_TAG. For example, from the log above, the Docker name to run detectnet_v2 can be derived as follows:


export DOCKER_REGISTRY="nvcr.io" export DOCKER_NAME="nvidia/tao/tao-toolkit" export DOCKER_TAG="4.0.0-tf1.15.5" export DOCKER_CONTAINER=$DOCKER_REGISTRY/$DOCKER_NAME:$DOCKER_TAG

Once you have the Docker name, invoke the container by running the commands defined by the network without the :code:`tao` prefix. For example, the following command will run a detectnet_v2 training job with 4 GPUs:


docker run -it --rm --gpus all \ -v /path/in/host:/path/in/docker \ $DOCKER_CONTAINER \ detectnet_v2 train -e /path/to/experiment/spec.txt \ -r /path/to/results/dir \ -k $KEY --gpus 4

From 3.0-21.11, TAO Toolkit supports multi-node training for the following CV models:

  • Image classification

  • Multi-task classification

  • Detectnet_v2

  • FasterRCNN

  • SSD

  • DSSD

  • YOLOv3

  • YOLOv4

  • YOLOv4-Tiny

  • RetinaNet

  • MaskRCNN

  • EfficientDet

  • UNet

For these networks, the only task that can run multi-node training is train. To invoke multi-node training, simply add the --multi-node argument to the train command.

For example, the multi-GPU training command given above can be issued as a multi-node command:


detectnet_v2 train -e /path/to/experiment/spec.txt \ -r /path/to/results/dir \ -k $KEY \ --gpus 4 \ --multi-node

TAO uses OPEN-MPI + HOROVOD to orchestrate multi-GPU and multi-node training. By default, the following arguments are appended to the mpirun command:


-x NCCL_IB_HCA=mlx5_4,mlx5_6,mlx5_8,mlx5_10 -x NCCL_SOCKET_IFNAME=^lo,docker

To add more arguments to the mpirun command, add them to the --mpirun-arg of the train command, as shown in the following example:


When running multi-node training, the entire dataset must be visible to all nodes running the training. If the data is not present, training jobs may crash with errors stating that the data couldn’t be found.

For example, if you have a .tar dataset that has been downloaded to one of the nodes (rank 0) in a multi-node job with two nodes and eight GPUs each, a simple way to extract the data would be to run it as a multi node process using mpirun:.


mpirun -np 16 --allow-run-as-root bash -c 'if [[ $OMPI_COMM_WORLD_LOCAL_RANK -eq 0 ]]; then set -x && tar -xf dataset.tar -C /raid; fi '

Running TAO Toolkit via the TAO Toolkit Launcher requires the user to have docker-ce installed since the launcher interacts with the Docker service on the local host to run the commands. Installing Docker requires elevated user privileges to run as root. If you don’t have elevated user privileges on your compute machine, you may run TAO Toolkit using Singularity. This requires you to bypass the tao-launcher and interact directly with the component docker containers. For information on which tasks are implemented in different Dockers, run the tao info --verbose command. Once you have derived the task-to-Docker mapping, you may run the tasks using the following steps:

  1. Pull the required Docker using the following singularity command:


    singularity pull tao-toolkit-tf:4.0.0-tf1.15.5 py3.sif docker://nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5 .. Note:: For this command to work, the latest version of singularity must be installed.

  2. Instantiate the Docker using the following command:


    singularity run --nv -B /path/to/workspace:/path/to/workspace tao-toolkit-tf:4.0.0-tf1.15.5.sif

  3. Run the commands inside the container without the tao prefix. For example, to run a detectnet_v2 training in the tao-toolkit-tf container, use the following command:


detectnet_v2 train -e /path/to/workspace/specs/file.txt \ -k $KEY \ -r /path/to/workspace/results \ -n name_of_final_model \ --gpus $NUM_GPUS

© Copyright 2022, NVIDIA.. Last updated on Mar 23, 2023.