TAO Launcher#

TAO encapsulates DNN training pipelines that may be developed across different training platforms. In order to abstract the details from TAO users, TAO now is packaged with a launcher CLI. The CLI is a python3 wheel package that may be installed using the python-pip. When installed, the launcher CLI abstracts the user from having to instantiate and run several TAO containers and map the commands accordingly.

In this release of TAO, the TAO package includes multiple underlying Docker containers based on each training framework. Each Docker container includes entrypoints to tasks that run the sub-tasks associated with them. The tasks in the containers are grouped into different task_groups, which are divided into the following categories:

model
dataset
deploy

The tasks under model contain routines to perform train, evaluate, and inference on one of any number of DNN models supported by TAO. The tasks under dataset contain routines to manipulate datasets, such as augment and auto_label, while the tasks under deploy optimize and deploy models to TensorRT.

The tasks are broadly divided into model, deploy, and dataset. For example, DetectNet_v2 is a computer vision task for object detection in TAO, which supports subtasks such as train, prune, evaluate, export etc. When the user executes a command, for example tao model detectnet_v2 train --help, the TAO Launcher does the following:

Pulls the required TAO container with the entrypoint for DetectNet_v2
Creates an instance of the container
Runs the detectnet_v2 entrypoint with the train sub-task

You may visualize the user interaction diagram for the TAO Launcher when running training and evaluation for a DetectNet_v2 model as follows:

Similarly, the interaction diagram for training a DINO model is as follows:

Running the launcher#

The sample jupyter notebooks in the tao_launcher_starter_kit directory of the TAO Getting started resource on NGC covers the steps to install the launcher CLI.

Once the launcher has been installed, the workflow to run the launcher is as follows.

Listing the task_groups supported in the toolkit.

After installing the launcher, you will now be able to list the tasks that are supported in the TAO Launcher. The output of tao model --help command is as follows:
usage: tao model [-h] {list,stop,info,dataset,deploy,model} ...

Launcher for TAO Toolkit.

optional arguments:
  -h, --help            show this help message and exit

task_groups:
  {list,stop,info,dataset,deploy,model}

Listing the tasks supported under a task_group.

Once you have listed the task_groups included as part of the toolkit, you can list the task associated with this group using the --help option in cascade. For example, to list the task under model, you can run the following command tao model --help

usage: tao model [-h]
                {list,stop,info,dataset,deploy,model} ...
                {action_recognition,classification_pyt,deformable_detr,dino,mal,ml_recog,ocdnet,ocrnet,optical_inspection,pointpillars,pose_classification,re_identification,re_identification_transformer,segformer} ...

optional arguments:
  -h, --help            show this help message and exit

task_groups:
  {list,stop,info,dataset,deploy,model}

task:
  {action_recognition,classification_pyt,deformable_detr,dino,mal,ml_recog,ocdnet,ocrnet,optical_inspection,pointpillars,pose_classification,re_identification,re_identification_transformer,segformer}

Configuring the launcher instance.
Running Deep Neural Networks implies working on large datasets. These datasets are usually present network share drives with significantly higher storage capacity. Since the TAO Launcher users docker containers under the hood, these drives/mount points need to be mapped to the docker. The launcher instance can be configured in the ~/.tao_mounts.json file.

The launcher config file consists of three sections:
- Mounts
- Envs
- DockerOptions
The Mounts parameter defines the paths in the local machine, that should be mapped to the docker. This is a list of json dictionaries containing the source path in the local machine and the destination path that is mapped for the TAO commands.

The Envs parameter defines the environment variables to be set to the respective TAO docker. This is also a list of dictionaries. Each dictionary entry has 2 key-value pairs defined.
- variable: The name of the environment variable you would like to set
- value: The value of the environment variable
The DockerOptions parameter defines the options to be set when invoking the docker instance. This parameter is a dictionary containing key-value pair of the parameter and option to set. Currently, the TAO Launcher only allows users to configure the following parameters.
- shm_size: Defines the shared memory size of the docker. If this parameter isn’t set, then the TAO instance allocates 64MB by default. We recommend setting this as "16G", thereby allocating 16GB of shared memory.
- ulimits: Defines the user limits in the docker. This parameter corresponds to the ulimit parameters in /etc/security/limits.conf. We recommend users set memlock to -1 and stack to 67108864.
- user: Defines the user id and group id of the user to run the commands in the docker. By default, if this parameter isn’t defined in the ~/.tao_mounts.json the uid and gid of the root user. However, this would mean that when directories created by the TAO dockers would be set to root permissions. If you would like to set the user in the docker to be the same as the host user, please set this parameter as “UID:GID”, where UID and GID can be obtained from the command line by running id -u and id -g.
- ports: This parameter defines the ports in the docker to be mounted to the host.
  
  You may specify this parameter as a dictionary containing the map between the port in the docker to the port in the host machine. For example, if you wish to expose port 8888 and port 8000, this parameter would look as follows:
  "ports":{ "8888":"8888", "8000":"8000" }
Please use the following code block as a sample for the ~/.tao_mounts.json file. In this mounts sample, we define 3 drives, an environment variable called CUDA_DEVICE_ORDER. For DockerOptions we set shared memory size of 16G, user limits and set the host user’s permission. We also bind the port 8888 from the docker to the host.
{ "Mounts": [ { "source": "/path/to/your/data", "destination": "/workspace/tao-experiments/data" }, { "source": "/path/to/your/local/results", "destination": "/workspace/tao-experiments/results" }, { "source": "/path/to/config/files", "destination": "/workspace/tao-experiments/specs" } ], "Envs": [ { "variable": "CUDA_DEVICE_ORDER", "value": "PCI_BUS_ID" } ], "DockerOptions": { "shm_size": "16G", "ulimits": { "memlock": -1, "stack": 67108864 }, "user": "1000:1000", "ports": { "8888": 8888 } } }
Similarly, a sample config file containing 2 mount points and no docker options is as below.
{ "Mounts": [ { "source": "/path/to/your/experiments", "destination": "/workspace/tao-experiments" }, { "source": "/path/to/config/files", "destination": "/workspace/tao-experiments/specs" } ] }

Running a task.

Once you have installed the TAO Launcher, you may now run the tasks supported by TAO Toolkit using the following command format.
tao model <task_group> <task> <subtask> <cli_args>
To view the sub-tasks supported by a certain task, you may run the command following the template

For example: Listing the tasks of detectnet_v2, the outputs is as follows:
$ tao model detectnet_v2 --help

usage: detectnet_v2 [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS] [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp] [--log_file LOG_FILE]
                    {train,prune,inference,export,evaluate,dataset_convert,calibration_tensorfile} ...

TAO Toolkit

optional arguments:
  -h, --help            show this help message and exit
  --num_processes NUM_PROCESSES, -np NUM_PROCESSES
                        The number of horovod child processes to be spawned. Default is -1(equal to --gpus).
  --gpus GPUS           The number of GPUs to be used for the job.
  --gpu_index GPU_INDEX [GPU_INDEX ...]
                        The indices of the GPU's to be used.
  --use_amp             Flag to enable Auto Mixed Precision.
  --log_file LOG_FILE   Path to the output log file.

tasks:
  {train,prune,inference,export,evaluate,dataset_convert,calibration_tensorfile}
2023-06-01 08:52:50,522 [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 326: Stopping container.
The TAO Launcher also supports a run command associated with every task to allow you to run custom scripts in the docker. This provides you the opporturnity bring in your own data pre-processing scripts and leverage the prebuilt dependencies and isolated dev environements in the TAO dockers.

For example, assume you have a shell script to download and preprocess COCO dataset into TFRecords for MaskRCNN, which requires TensorFlow as a dependency. You can simply map the directory containing that script to the docker using the steps mentioned in step 4 below with the ~/.tao_mounts.json and run it as
tao model mask_rcnn run /path/to/download_and_preprocess_coco.sh <script_args>

The tao launcher CLI allows you to interactively run commands inside the docker associated with the tasks you wish to run. This is a useful tool for debugging commands inside the docker and viewing the filesystems from inside the container, as opposed to viewing the end output in the host system. To invoke an interactive session, run the tao command with the task and no other arguments. For example, to run interactive commands inside the docker containing the detectnet_v2 task, run the following command:
```
tao model detectnet_v2
```
This command opens up an interactive session inside the tao-toolkit-tf docker.

Note

The interactive command uses the ~/.tao_mounts.json file to configure the launcher and mount paths in the host file system to the docker.

Once you are inside the interactive session, you may run the command task and its associated subtask by calling the <task> <subtask> <cli_args> commands without the tao prefix.

For example, to train a detectnet_v2 model in the interactive session, run the following command after invoking an interactive session using tao model detectnet_v2
```
detectnet_v2 train -e /path/to/experiment_spec.txt
                   -k <key>
                   -r /path/to/train/output
                   --gpus <number of GPUs>
```

Handling launched processes#

TAO Launcher allows users to list all the processes that were launched by an instance of the TAO Launcher on the host machine and kill any jobs the user may deem unnecessary using the list and stop command.

Listing TAO launched processes

The list command, as the name suggests prints out a tabulated list of running processes with the command that was used to invoke the process.

A sample output of tao model list command when you have 2 processes running is as below.

==============  ==================  =============================================================================================================================================================================================
container_id    container_status    command
==============  ==================  =============================================================================================================================================================================================
5316a70139      running             detectnet_v2 train -e /workspace/tao-experiments/detectnet_v2/experiment_dir_unpruned/experiment_spec.txt -k tlt_encode -r /workspace/tao-experiments/detectnet_v2/experiment_dir_unpruned
==============  ==================  =============================================================================================================================================================================================

Killing running TAO instances

The tao stop command helps in killing the running containers should the users wish to abort the respective session. The usage for the tao stop command is as mentioned below.

usage: tao stop [-h] [--container_id CONTAINER_ID [CONTAINER_ID ...]] [--all]
            {info,list,stop,augment,classification,classifynet,detectnet_v2,dssd,emotionnet,faster_rcnn,fpenet,gazenet,heartratenet,intent_slot_classification,lprnet,mask_rcnn,punctuation_and_capitalization,question_answering,retinanet,speech_to_text,ssd,text_classification,converter,token_classification,yolo_v3,yolo_v4,yolo_v4_tiny}
            ...

optional arguments:
-h, --help            show this help message and exit
--container_id CONTAINER_ID [CONTAINER_ID ...]
                        Ids of the containers to be stopped.
--all                 Kill all TAO running TAO containers.

tasks:
{info,list,stop,augment,classification,classifynet,detectnet_v2,dssd,emotionnet,faster_rcnn,fpenet,gazenet,heartratenet,intent_slot_classification,lprnet,mask_rcnn,punctuation_and_capitalization,question_answering,retinanet,speech_to_text,ssd,text_classification,converter,token_classification,yolo_v3,yolo_v4,yolo_v4_tiny}

With tao stop, you may choose to either

Kill a subset of the containers shown by the tao model list command by providing multiple container id’s to the launcher’s --container_id arg

A sample output of the tao model list command after running tao stop --container_id 5316a70139, is as below.

==============  ==================  =============================================================================================================================================================================================
container_id    container_status    command
==============  ==================  =============================================================================================================================================================================================
5316a70139      running             detectnet_v2 train -e /workspace/tao-experiments/detectnet_v2/experiment_dir_unpruned/experiment_spec.txt -k tlt_encode -r /workspace/tao-experiments/detectnet_v2/experiment_dir_unpruned
==============  ==================  =============================================================================================================================================================================================

Force kill all the containers by using the tao stop --all command.

A sample output of tao model list command after running the tao stop --all command is as below.

==============  ==================  =========
container_id    container_status    command
==============  ==================  =========
==============  ==================  =========

Retrieving information for the underlying TAO components

The tao info command allows users to retrieve information about the underlying components in the launcher. To retrieve options for the tao info command, you can use the tao info --help command. The sample usage for the command is as follows:

usage: tao info [-h] [--verbose]
            {info,list,stop,info,augment,classification,detectnet_v2, ... ,converter,token_classification,unet,yolo_v3,yolo_v4,yolo_v4_tiny}
            ...

optional arguments:
-h, --help            show this help message and exit
--verbose             Print information about the TAO instance.

tasks:
{info,list,stop,info,augment,classification,detectnet_v2,dssd,emotionnet,faster_rcnn,fpenet,gazenet,gesturenet,heartratenet,intent_slot_classification,lprnet,mask_rcnn,punctuation_and_capitalization,question_answering,retinanet,speech_to_text,ssd,text_classification,converter,token_classification,unet,yolo_v3,yolo_v4,yolo_v4_tiny}

When you run tao info, the launcher returns concise information about the launcher, namely the docker container names, format version of the launcher config, TAO version, and publishing date.

Configuration of the TAO Instance
task_group: ['model', 'dataset', 'deploy']
format_version: 3.0
toolkit_version: 5.0.0
published_date: 05/31/2023

For more information about the dockers and the tasks supported by the docker, you may use the --verbose option of the tao info command. A sample output of the tao info --verbose command is shown below.

Configuration of the TAO Instance

task_group:
    model:
        dockers:
            nvidia/tao/tao-toolkit:
                5.0.0-tf2.9.1:
                    docker_registry: nvcr.io
                    tasks:
                        1. classification_tf2
                        2. efficientdet_tf2
                5.0.0-tf1.15.5:
                    docker_registry: nvcr.io
                    tasks:
                        1. bpnet
                        2. classification_tf1
                        3. converter
                        4. detectnet_v2
                        5. dssd
                        6. efficientdet_tf1
                        7. faster_rcnn
                        8. fpenet
                        9. lprnet
                        10. mask_rcnn
                        11. multitask_classification
                        12. retinanet
                        13. ssd
                        14. unet
                        15. yolo_v3
                        16. yolo_v4
                        17. yolo_v4_tiny
                5.0.0-pyt:
                    docker_registry: nvcr.io
                    tasks:
                        1. action_recognition
                        2. centerpose
                        3. classification_pyt
                        4. deformable_detr
                        5. dino
                        6. mal
                        7. ml_recog
                        7. ocdnet
                        8. ocrnet
                        9. optical_inspection
                        10. pointpillars
                        11. pose_classification
                        12. re_identification
                        13. segformer
                        14. visual_changenet
    dataset:
        dockers:
            nvidia/tao/tao-toolkit:
                5.0.0-dataservice:
                    docker_registry: nvcr.io
                    tasks:
                        1. augmentation
                        2. auto_label
                        3. annotations
                        4. analytics
    deploy:
        dockers:
            nvidia/tao/tao-toolkit:
                5.0.0-deploy:
                    docker_registry: nvcr.io
                    tasks:
                        1. centerpose
                        2. classification_pyt
                        3. classification_tf1
                        4. classification_tf2
                        5. deformable_detr
                        6. detectnet_v2
                        7. dino
                        8. dssd
                        9. efficientdet_tf1
                        10. efficientdet_tf2
                        11. faster_rcnn
                        12. lprnet
                        13. mask_rcnn
                        14. ml_recog
                        15. multitask_classification
                        16. ocdnet
                        17. ocrnet
                        18. optical_inspection
                        19. retinanet
                        20. segformer
                        21. ssd
                        22. unet
                        23. yolo_v3
                        24. yolo_v4
                        25. yolo_v4_tiny
                        26. visual_changenet
format_version: 3.0
toolkit_version: 5.0.0
published_date: 05/31/2023

Useful Environment variables#

The TAO Launcher watches the following environment variables to override certain configurable parameters.

LAUNCHER_MOUNTS: This environment variable defines the path to the default launcher configuration .json file. If not set, the launcher configuration path is picked up from ~/.tao_mounts.json.
OVERRIDE_REGISTRY: This environment variable defines the registry to pull the TAO dockers from. By default, the TAO docker is hosted in NGC under the repository nvcr.io. For example, if you set the OVERRIDE_REGISTRY environment variables as shown below,
```
export OVERRIDE_REGISTRY="dockerhub.io"
```
the dockers would be

dockerhub.io/nvidia/tao/tao-toolkit:<docker_tag>
dockerhub.io/nvidia/tao/tao-toolkit:<docker_tag>
Note

When using the OVERRIDE_REGISTRY variable, use the docker login command
to log in to this registry.
docker login $OVERRIDE_REGISTRY