TAO Clearml Integration#

TAO’s TensorFlow 2 networks integrate with ClearML, enabling you to continuously iterate, visualize and track multiple training experiments, and compile meaningful insights into a training use case.

ClearML visualization suite synchronizes with the data rendered in TensorBoard. Therefore, to see rendered data over the ClearML server, you need to enable TensorBoard visualization. The integration also includes the ability to send you alerts via slack or Email for training runs that have failed.

Note

Enabling MLOPS integration does not require you to install tensorboard.

Quick Start#

These are the broad steps involved with setting up ClearML for TAO:

  1. Set up a ClearML account

  2. Acquire ClearML credentials

  3. Log into the ClearML client

  4. Configure experiment spec for ClearML

Set up a ClearML Account#

Sign up for a free account at the ClearML website and then log in to your ClearML account.

Acquire ClearML API Credentials#

  1. Log in to your ClearML account

  2. Navigate to the Configuration Page

  3. Click on Create new credentials to generate API keys

Note

NVIDIA recommends getting the credentials in the form of environment variables for maximum ease of use. You can get these variables by clicking on the Jupyter Notebook tab and copying the env variables.

../../_images/clearml_credentials.png

Jupyter notebook tab from the credentials under Settings/Workspace#

Log in to the ClearML Client#

To send data from your local compute unit and display it on the ClearML server dashboard, log in to the ClearML client within the TAO Finetuning Microservice and synchronize it with your profile. Add the following parameters to the docker_env_vars dictionary in your create_experiment request body.

For example:

{
    "docker_env_vars": {
        "CLEARML_WEB_HOST": "<web_host>",
        "CLEARML_API_HOST": "<api_host>",
        "CLEARML_FILES_HOST": "<files_host>",
        "CLEARML_API_ACCESS_KEY": "<api_access_key>",
        "CLEARML_API_SECRET_KEY": "<api_secret_key>",
    }
}

Configure the ClearML Element in the Training Specification

The TAO toolkit provides the following configuration options for ClearML:

  1. project: String specifying the project name where experiment data is uploaded

  2. tags: List of strings for experiment tagging

  3. deferred_init: Boolean to determine whether to wait for the experiment to be fully initialized

  4. continue_last_task: Boolean to resume execution from a previous experiment

  5. reuse_last_task_id: Forces new experiment creation with an existing task ID

  6. task: Names the experiment (TAO appends a timestamp to ensure unique names per run)

Add these configurations to the clearml element in your specs request body. For example:

{
    "specs": {
        "train": {
            "clearml": {
                "project": "tao_toolkit",
                "tags": ["training", "tao_toolkit"],
                "deferred_init": true,
                "continue_last_task": true,
                "task": "training_experiment_name"
            }
        }
    }
}

To communicate the data from the local compute unit and render data on the ClearML server dashboard, the ClearML client in the TAO container must be logged in and synchronized with your profile. To have the clearml client in the container log in, set the following environment variables with the data you received when setting up your ClearML account.

%env CLEARML_WEB_HOST=https://app.clear.ml
%env CLEARML_API_HOST=https://api.clear.ml
%env CLEARML_FILES_HOST=https://files.clear.ml
%env CLEARML_API_ACCESS_KEY=<API_ACCESS_KEY>
%env CLEARML_API_SECRET_KEY=<API_SECRET_KEY>

To set the environment variable via the TAO launcher, use the sample JSON file below for reference and replace the value field under the Envs element of the ~/.tao_mounts.json file.

{
    "Mounts": [
        {
            "source": "/path/to/your/data",
            "destination": "/workspace/tao-experiments/data"
        },
        {
            "source": "/path/to/your/local/results",
            "destination": "/workspace/tao-experiments/results"
        },
        {
            "source": "/path/to/config/files",
            "destination": "/workspace/tao-experiments/specs"
        }
    ],
    "Envs": [
        {
            "variable": "CLEARML_WEB_HOST",
            "value": "https://app.clear.ml"
        },
        {
            "variable": "CLEARML_API_HOST",
            "value": "https://api.clear.ml"
        },
        {
            "variable": "CLEARML_FILES_HOST",
            "value": "https://files.clear.ml"
        },
        {
            "variable": "CLEARML_API_ACCESS_KEY",
            "value": "<API_ACCESS_KEY>"
        },
        {
            "variable": "CLEARML_API_SECRET_KEY",
            "value": "<API_SECRET_KEY>"
        }
    ],
    "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
        },
        "user": "1000:1000",
        "ports": {
            "8888": 8888
        }
    }
}

Note

When running the networks from TAO containers directly, use the -e flag with the docker command. For example, to run classification_tf2 with ClearML directly via the container, use the following code.

docker run -it --rm --gpus all \
        -v /path/in/host:/path/in/docker \
        -e CLEARML_WEB_HOST="https://app.clear.ml" \
        -e CLEARML_API_HOST="https://api.clear.ml" \
        -e CLEARML_FILES_HOST="https://files.clear.ml" \
        -e CLEARML_API_ACCESS_KEY="<API_ACCESS_KEY>" \
        -e CLEARML_API_SECRET_KEY="<API_SECRET_KEY>" \
        nvcr.io/nvidia/tao/tao-toolkit:6.0.0-tf2 \
        classification_tf2 train -e /path/to/experiment/spec.yaml

Configure the ClearML Element in the Training Spec

TAO provides a few options to configure the clearml client:

  1. project: String specifying the project name where experiment data is uploaded

  2. tags: List of strings for experiment tagging

  3. deferred_init: Boolean to determine whether to wait for the experiment to be fully initialized

  4. continue_last_task: Boolean to resume execution from a previous experiment

  5. reuse_last_task_id: Forces new experiment creation with an existing task ID

  6. task: Names the experiment (TAO appends a timestamp to ensure unique names per run)

For EfficientDet-TF2 and Classification-TF2, add the following snippet under the train config element in the train.yaml file.

clearml:
    task: "name_of_the_experiment"
    project: "name_of_the_project"

Visualization Output#

The following are sample images from a successful visualization run for DetectNet_v2.

../../_images/rich_media_images.png

Image showing intermediate inference images with bounding boxes before clustering using DBScan or NMS#

../../_images/system_utilization.png

Image showing system utilization plots.#

../../_images/metric_plots.png

Metrics plotted during training#

../../_images/logging.png

Streaming logs from the local machine running the training.#

../../_images/histograms.png

Weight histograms of the trained model.#