Monitoring and Logging

The training code can log the model- and system-related metrics to both TensorBoard and Weights & Biases (W&B). The local files are stored in the directory specified in the training.exp_manager.explicit_log_dir parameter. TensorBoard logs are saved by default.

However, the W&B API key must be specified for W&B to work properly. To upload the logs to W&B, you must first store the API key in the first (normally the only) line of a text file and set the wandb_api_key_file parameter to the file’s pathname. For Base Command Platform, you can store this file in a dataset or workspace mounted for the job.

You must set the following training configurations to enable logging of training metrics to W&B:

Copy
Copied!
            

exp_manager: create_wandb_logger: True wandb_logger_kwargs: project: [W&B project name] name: [W&B run name]

The logs show reduced_train_loss, val_loss, train_step_timing metrics, and other relevant metrics. train_step_timing is the measure pf the time it takes to finish each global step.

Previous Data Preparation
Next Using AutoConfigurator to Find the Optimal Configuration
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.