TAO WandB Integration#

All networks in TAO support integration with Weights & Biases to help you continuously iterate, visualize and track multiple training experiments, and compile meaningful insights into the training use case.

The Weights & Biases visualization suite synchronizes with the data rendered in TensorBoard. Therefore, to see rendered data over the weights and biases server, you will need to enable TensorBoard visualization. The integration also includes the ability to send you alerts via slack or email for training runs that have failed.

Note

Enabling MLOPS integration does not require you to install tensorboard.

Quick Start#

These are the broad steps involved with setting up Weights & Biases for TAO:

  1. Set up a wandb account

  2. Obtain your wandb API key

  3. Log in to wandb

  4. Configure the wandb element in the training specification.

Set Up a wandb Account#

Sign up for a free account at wandb and then log in.

../../_images/wandb_login.png

Wandb login screen#

Acquire a wandb API Key#

After you log in to your Weights & Biases account, find your API key here.

../../_images/wandb_credentials.png

Wandb credentials page#

Log In to wandb#

Log In to wandb Through the TAO Finetuning Microservice To communicate the data from the local compute unit and render data on the Weights & Biases server dashboard, the wandb client in the TAO container must be logged in and synchronized with your profile.

To log in to wandb when running FTMS jobs, add the WANDB_API_KEY to

the docker_env_vars dictionary in the

create_experiment request body. For example:

{
    "docker_env_vars": {
        "WANDB_API_KEY": "<api_key_value>"
    }
}

Configure the wandb Element in the Training Specification

FTMS provides the following configuration options for the wandb client:

  1. enabled: Boolean that enables the wandb client

  2. project: String specifying the project name where experiment data is uploaded

  3. entity: String specifying the entity (group) under which the project is created

  4. tags: List of strings for experiment tagging

  5. reinit: Boolean allowing reinitializing of runs

  6. sync_tensorboard: Boolean enabling TensorBoard log synchronization

  7. save_code: Boolean saving main scripts or notebooks to wandb for reproducibility

  8. name: Short display name for the run. TAO appends a timestamp to maintain uniqueness.

Add these configurations to the wandb element in the training specs request body. For example:

{
    "specs": {
        "wandb": {
            "enabled": true,
            "project": "tao_toolkit",
            "entity": "tao_toolkit",
            "tags": ["training", "tao_toolkit"],
            "reinit": true,
            "sync_tensorboard": true,
            "save_code": false,
            "name": "training_experiment_name"
        }
    }
}

Visualization output#

The following sample images come from a successful Grounding Dino visualization run.

../../_images/rich_media_images1.png

Image showing images with bounding boxes#

../../_images/system_utilization1.png

Image showing system utilization plots.#

../../_images/experiment_config.png

Configuration of the given experiment was saved for records.#

../../_images/metric_plots1.png

Metrics plotted during training#

../../_images/logging1.png

Streaming logs from the local machine running the training.#