Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Monitoring and Logging
The training code can log the model- and system-related metrics to both
TensorBoard and Weights & Biases (W&B). The local files are stored
in the directory specified in the
training.exp_manager.explicit_log_dir
parameter. TensorBoard logs
are saved by default.
However, the W&B API key must be specified for W&B to work properly. To
upload the logs to W&B, you must first store the API key in the first (normally the only) line of a text
file and set the wandb_api_key_file
parameter to the file’s pathname. For
Base Command Platform, you can store this file in a dataset or workspace
mounted for the job.
You must set the following training configurations to enable logging of training metrics to W&B:
exp_manager:
create_wandb_logger: True
wandb_logger_kwargs:
project: [W&B project name]
name: [W&B run name]
The logs show reduced_train_loss
, val_loss
, train_step_timing
metrics, and other relevant metrics. train_step_timing
is the measure pf the time it takes to finish each global
step.