TAO WandB Integration#

All networks in TAO support integration with Weights & Biases to help you continuously iterate, visualize and track multiple training experiments, and compile meaningful insights into the training use case.

The Weights & Biases visualization suite synchronizes with the data rendered in TensorBoard. Therefore, to see rendered data over the weights and biases server, you will need to enable TensorBoard visualization. The integration also includes the ability to send you alerts via slack or email for training runs that have failed.

Note

Enabling MLOPS integration does not require you to install tensorboard.

Quick Start#

These are the broad steps involved with setting up Weights & Biases for TAO:

Set up a wandb account
Obtain your wandb API key
Log in to wandb
Configure the wandb element in the training specification.

Set Up a wandb Account#

Sign up for a free account at wandb and then log in.

../../_images/wandb_login.png — Wandb login screen#

Acquire a wandb API Key#

After you log in to your Weights & Biases account, find your API key here.

../../_images/wandb_credentials.png — Wandb credentials page#

Log In to wandb#

To communicate data from the compute unit running training and render it on the Weights & Biases dashboard, the wandb client running inside the TAO container must be logged in. Export your API key in the shell that launches your TAO agent:

export WANDB_API_KEY="<api_key_value>"
export WANDB_PROJECT="<your-project-name>"

The TAO Execution SDK forwards these variables into the training container at dispatch time. If you change the key, exit and restart the agent in the same shell so the new value is picked up.

Configure the wandb Element in the Training Specification#

The following configuration options are available for the wandb client:

enabled: Boolean that enables the wandb client
project: String specifying the project name where experiment data is uploaded
entity: String specifying the entity (group) under which the project is created
tags: List of strings for experiment tagging
reinit: Boolean allowing reinitializing of runs
sync_tensorboard: Boolean enabling TensorBoard log synchronization
save_code: Boolean saving main scripts or notebooks to wandb for reproducibility
name: Short display name for the run; TAO appends a timestamp to maintain uniqueness

Add these to the wandb element of your training specification. For example:

wandb:
  enabled: true
  project: tao_toolkit
  entity: tao_toolkit
  tags: [training, tao_toolkit]
  reinit: true
  sync_tensorboard: true
  save_code: false
  name: training_experiment_name

When you ask the agent to run a job, it resolves these spec keys for the model you chose and submits the dispatch through the TAO Execution SDK.

Visualization output#

The following sample images come from a successful Grounding DINO visualization run.

../../_images/rich_media_images1.png — Image showing images with bounding boxes#

../../_images/system_utilization1.png — Image showing system utilization plots.#

../../_images/experiment_config.png — Configuration of the given experiment was saved for records.#

../../_images/metric_plots1.png — Metrics plotted during training#

../../_images/logging1.png — Streaming logs from the local machine running the training.#