Connecting to Weights & Biases
Weights & Biases (W&B) is a widely-used tool for charting training metrics for machine learning jobs, such as loss curves, resource usage, accuracy scores, and more. This makes it easy to validate how a model learns over time and to compare multiple runs to determine the best models for certain outcomes.
W&B supports a simple Python API to send training information to their servers. To use the API, users will need to create an access token on W&B, install the Python package, and tell W&B which values to track.
Setup
First, create a W&B access token by navigating to https://wandb.ai and click Sign Up in the top right to create a free account if not done already. Once logged in, go to https://wandb.ai/settings and go to the bottom to create a new API key. This API key needs to be specified for jobs that use W&B.
Python Package Installation
The container used to run your job on DGX Cloud Lepton needs the W&B Python package installed. Some NGC images like the NeMo Framework container (nvcr.io/nvidia/nemo
) already have the package installed, while others like the PyTorch image (nvcr.io/nvidia/pytorch
) do not. If your container does not have W&B installed already, run this command as part of an entrypoint on container start or in a running container.
pip3 install wandb
You can check if your container already has W&B installed with:
pip3 freeze | grep wandb
If the above command returns nothing, W&B is not installed already.
Example W&B Job
The following is a trivial example of a job that sends metrics to W&B using the API. The key points are:
- Import the
wandb
module - Initialize the
wandb
project withwandb.init
and specify hyperparameters and the project name - Tell W&B which values to send with
wandb.log()
import wandb
import random
# start a new wandb run to track this script
wandb.init(
# set the wandb project where this run will be logged
project="my-awesome-project",
# track hyperparameters and run metadata
config={
"learning_rate": 0.02,
"architecture": "CNN",
"dataset": "CIFAR-100",
"epochs": 10,
}
)
# simulate training
epochs = 10
offset = random.random() / 5
for epoch in range(2, epochs):
acc = 1 - 2 ** -epoch - random.random() / epoch - offset
loss = 2 ** -epoch + random.random() / epoch + offset
# log metrics to wandb
wandb.log({"acc": acc, "loss": loss})
# [optional] finish the wandb run, necessary in notebooks
wandb.finish()
To authenticate with W&B, set the WANDB_API_KEY
environment variable to your API key created earlier:
export WANDB_API_KEY=xxxxxxxxx
You can also set this environment variable directly in the platform when defining the job.
After running the example code, you should see a new project called my-awesome-project
in your W&B account.
For your own W&B experiments, adding the API key will automate the login process so your own code should run automatically connected to your account.
Integration with NVIDIA NeMo Framework
NVIDIA NeMo Framework supports W&B natively. To use W&B with NeMo Framework, set your W&B key as an environment variable named WANDB_API_KEY
. Refer to the documentation on integrating W&B for your specific NeMo Framework job.