TAO Toolkit Clearml Integration

The following networks in TAO Toolkit interface with ClearML, allowing you to continuously iterate, visualize and track multiple training experiments, and compile meaningful insights into a training use case.

  1. DetectNet-v2

  2. FasterRCNN

  3. Image Classification - TF2

  4. RetinaNet

  5. YOLOv4/YOLOv4-Tiny

  6. YOLOv3

  7. SSD

  8. DSSD

  9. EfficientDet - TF2

  10. MaskRCNN

  11. UNet

In TAO Toolkit 4.0.1, the ClearML visualization suite synchronizes with the data rendered in TensorBoard. Therefore, to see rendered data over the ClearML server, you need to enable TensorBoard visualization. The integration also includes the ability to send you alerts via slack or Email for training runs that have failed.

Note

Enabling MLOPS integration does not require you to install tensorboard.

These are the broad steps involved with setting up ClearML for TAO Toolkit:

  1. Setting up a ClearML account

  2. Acquiring a ClearML credentials

  3. Logging into the ClearML client

  4. Setting the configurable data for the ClearML experiment

Setting up a ClearML Account

Sign up for a free account at the ClearML website and then log in to your ClearML account.

Acquiring a ClearML API Credentials

Once you have logged in to your ClearML account, generate new credentials by navigating to the settings pane in the top-right corner of this window and clicking on Generate New Credentials.

Note

NVIDIA recommends getting the credentials in the form of environment variables for maximum ease of use. You can get these variables by clicking on the Jupyter Notebook tab and copying the env variables.

clearml_credentials.png

Jupyter notebook tab from the credentials under Settings/Workspace

Install clearml Library

Install the clearml library on your local machine in a Python3 environment.

Copy
Copied!
            

python3 -m pip install clearml

Log in to the ClearML Client in the TAO Toolkit Container

To communicate the data from the local compute unit and render data on the ClearML server dashboard, the ClearML client in the TAO Toolkit container must be logged in and synchronized with your profile. To have the clearml client in the container log in, set the following environment variables with the data you received when setting up your ClearML account.

Copy
Copied!
            

%env CLEARML_WEB_HOST=https://app.clear.ml %env CLEARML_API_HOST=https://api.clear.ml %env CLEARML_FILES_HOST=https://files.clear.ml %env CLEARML_API_ACCESS_KEY=<API_ACCESS_KEY> %env CLEARML_API_SECRET_KEY=<API_SECRET_KEY>

To set the environment variable via the TAO Toolkit launcher, use the sample JSON file below for reference and replace the value field under the Envs element of the ~/.tao_mounts.json file.

Copy
Copied!
            

{ "Mounts": [ { "source": "/path/to/your/data", "destination": "/workspace/tao-experiments/data" }, { "source": "/path/to/your/local/results", "destination": "/workspace/tao-experiments/results" }, { "source": "/path/to/config/files", "destination": "/workspace/tao-experiments/specs" } ], "Envs": [ { "variable": "CLEARML_WEB_HOST", "value": "https://app.clear.ml" }, { "variable": "CLEARML_API_HOST", "value": "https://api.clear.ml" }, { "variable": "CLEARML_FILES_HOST", "value": "https://files.clear.ml" }, { "variable": "CLEARML_API_ACCESS_KEY", "value": "<API_ACCESS_KEY>" }, { "variable": "CLEARML_API_SECRET_KEY", "value": "<API_SECRET_KEY>" } ], "DockerOptions": { "shm_size": "16G", "ulimits": { "memlock": -1, "stack": 67108864 }, "user": "1000:1000", "ports": { "8888": 8888 } } }

Note

When running the networks from TAO Toolkit containers directly, use the -e flag with the docker command. For example, to run detectnet_v2 with ClearML directly via the container, use the following code.

Copy
Copied!
            

docker run -it --rm --gpus all \ -v /path/in/host:/path/in/docker \ -e CLEARML_WEB_HOST="https://app.clear.ml" \ -e CLEARML_API_HOST="https://api.clear.ml" \ -e CLEARML_FILES_HOST="https://files.clear.ml" \ -e CLEARML_API_ACCESS_KEY="<API_ACCESS_KEY>" \ -e CLEARML_API_SECRET_KEY="<API_SECRET_KEY>" \ nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 \ detectnet_v2 train -e /path/to/experiment/spec.txt \ -r /path/to/results/dir \ -k $KEY --gpus 4

TAO Toolkit provides a few options to configure the clearml client:

  1. project: A string containing the name of the project the experiment data gets uploaded to

  2. tags: A list of strings that can be used to tag the experiment

  3. task: The name of the experiment. In order to maintain a unique name per run, TAO Toolkit appends to the name string a timestamp indicating when the experiment run was created.

Depending on the schema the network follows, the spec file snippet to be added to the network may vary slightly.

For DetectNet_v2, UNet, FasterRCNN, YOLOv3/YOLOv4/YOLOv4-Tiny, RetinaNet, and SSD/DSSD, please add the following snippet under the training_config config element of the network.

Copy
Copied!
            

visualizer{ enabled: true clearml_config{ project: "name_of_project" tags: "training" tags: "tao_toolkit" task: "training_experiment_name" } }

For MaskRCNN, add the following snippet in the network’s training configuration

Copy
Copied!
            

clearml_config{ project: "name_of_project" tags: "training" tags: "tao_toolkit" task: "training_experiment_name" }

For EfficientDet-TF2 and Classification-TF2, add the following snippet under the train config element in the train.yaml file.

Copy
Copied!
            

clearml: task: "name_of_the_experiment" project: "name_of_the_project"

The following are sample images from a successful visualization run for DetectNet_v2.

rich_media_images.png

Image showing intermediate inference images with bounding boxes before clustering using DBScan or NMS

system_utilization.png

Image showing system utilization plots.

metric_plots.png

Metrics plotted during training

logging.png

Streaming logs from the local machine running the training.

histograms.png

Weight histograms of the trained model.

Previous TAO Toolkit WandB Integration
Next ASR
© Copyright 2024, NVIDIA. Last updated on Mar 22, 2024.