TAO Clearml Integration#

TAO’s TensorFlow 2 networks integrate with ClearML, enabling you to continuously iterate, visualize and track multiple training experiments, and compile meaningful insights into a training use case.

ClearML visualization suite synchronizes with the data rendered in TensorBoard. Therefore, to see rendered data over the ClearML server, you need to enable TensorBoard visualization. The integration also includes the ability to send you alerts via slack or Email for training runs that have failed.

Note

Enabling MLOPS integration does not require you to install tensorboard.

Quick Start#

These are the broad steps involved with setting up ClearML for TAO:

  1. Set up a ClearML account

  2. Acquire ClearML credentials

  3. Log into the ClearML client

  4. Configure experiment spec for ClearML

Set up a ClearML Account#

Sign up for a free account at the ClearML website and then log in to your ClearML account.

Acquire ClearML API Credentials#

  1. Log in to your ClearML account

  2. Navigate to the Configuration Page

  3. Click on Create new credentials to generate API keys

Note

NVIDIA recommends getting the credentials in the form of environment variables for maximum ease of use. You can get these variables by clicking on the Jupyter Notebook tab and copying the env variables.

../../_images/clearml_credentials.png

Jupyter notebook tab from the credentials under Settings/Workspace#

Log in to the ClearML Client#

To send data from your local compute unit and display it on the ClearML server dashboard, log in to the ClearML client within the TAO Finetuning Microservice and synchronize it with your profile. Add the following parameters to the docker_env_vars dictionary in your create_experiment request body.

For example:

{
    "docker_env_vars": {
        "CLEARML_WEB_HOST": "<web_host>",
        "CLEARML_API_HOST": "<api_host>",
        "CLEARML_FILES_HOST": "<files_host>",
        "CLEARML_API_ACCESS_KEY": "<api_access_key>",
        "CLEARML_API_SECRET_KEY": "<api_secret_key>",
    }
}

Configure the ClearML Element in the Training Specification

The TAO toolkit provides the following configuration options for ClearML:

  1. project: String specifying the project name where experiment data is uploaded

  2. tags: List of strings for experiment tagging

  3. deferred_init: Boolean to determine whether to wait for the experiment to be fully initialized

  4. continue_last_task: Boolean to resume execution from a previous experiment

  5. reuse_last_task_id: Forces new experiment creation with an existing task ID

  6. task: Names the experiment (TAO appends a timestamp to ensure unique names per run)

Add these configurations to the clearml element in your specs request body. For example:

{
    "specs": {
        "train": {
            "clearml": {
                "project": "tao_toolkit",
                "tags": ["training", "tao_toolkit"],
                "deferred_init": true,
                "continue_last_task": true,
                "task": "training_experiment_name"
            }
        }
    }
}

Visualization Output#

The following are sample images from a successful visualization run for DetectNet_v2.

../../_images/rich_media_images.png

Image showing intermediate inference images with bounding boxes before clustering using DBScan or NMS#

../../_images/system_utilization.png

Image showing system utilization plots.#

../../_images/metric_plots.png

Metrics plotted during training#

../../_images/logging.png

Streaming logs from the local machine running the training.#

../../_images/histograms.png

Weight histograms of the trained model.#