Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Monitoring and Logging

The training code can log the model- and system-related metrics to both TensorBoard and Weights & Biases (W&B). The local files are stored in the directory specified in the training.exp_manager.explicit_log_dir parameter. TensorBoard logs are saved by default.

However, the W&B API key must be specified for W&B to work properly. To upload the logs to W&B, you must first store the API key in the first (normally the only) line of a text file and set the wandb_api_key_file parameter to the file’s pathname. For Base Command Platform, you can store this file in a dataset or workspace mounted for the job.

You must set the following training configurations to enable logging of training metrics to W&B:

exp_manager:
    create_wandb_logger: True
    wandb_logger_kwargs:
        project: [W&B project name]
        name: [W&B run name]

The logs show reduced_train_loss, val_loss, train_step_timing metrics, and other relevant metrics. train_step_timing is the measure pf the time it takes to finish each global step.