Logging and Monitoring#
This guide describes how to configure logging in Megatron Bridge. It introduces the high-level LoggerConfig
, explains experiment logging to TensorBoard and Weights & Biases (W&B), and documents console logging behavior.
LoggerConfig Overview#
LoggerConfig
is the dataclass that encapsulates logging‑related settings for training. It resides inside the overall bridge.training.config.ConfigContainer
, which represents the complete configuration for a training run.
Timer Configuration Options#
Use the following options to control which timing metrics are collected during training and how they are aggregated and logged.
timing_log_level
#
Controls which timers are recorded during execution:
Level 0: Logs only the overall iteration time.
Level 1: Includes once-per-iteration operations, such as gradient all-reduce.
Level 2: Captures frequently executed operations, providing more detailed insights but with increased overhead.
timing_log_option
#
Specifies how timer values are aggregated across ranks. Valid options:
"max"
: Logs the maximum value across ranks."minmax"
: Logs both minimum and maximum values."all"
: Logs all values from all ranks.
log_timers_to_tensorboard
#
When enabled, the framework records timer metrics to supported backends such as TensorBoard.
Diagnostic Options#
The framework provides several optional toggles for enhanced monitoring and diagnostics:
Loss Scale: Enables dynamic loss scaling for mixed-precision training.
Validation Perplexity: Tracks model perplexity during validation.
CUDA Memory Statistics: Reports detailed GPU memory usage.
World Size: Displays the total number of distributed ranks.
Logging Options#
Use the following options to enable additional diagnostics and performance monitoring during training.
log_params_norm
: Computes and logs the L2 norm of model parameters. If available, it also logs the gradient norm.log_energy
: Activates the energy monitor, which records per-GPU energy consumption and instantaneous power usage.
Experiment Logging#
Both TensorBoard and W&B are supported for metric logging. When using W&B, it’s recommended to also enable TensorBoard to ensure that all scalar metrics are consistently logged across backends.
TensorBoard#
What Gets Logged#
TensorBoard captures a range of training and system metrics, including:
Learning rate, including decoupled LR when applicable
Per-loss scalars for detailed breakdowns
Batch size and loss scale
CUDA memory usage and world size (if enabled)
Validation loss, with optional perplexity
Timers, when timing is enabled
Energy consumption and instantaneous power, if energy logging is active
Enable TensorBoard Logging#
Install TensorBoard (if not already available):
pip install tensorboard
Configure logging in your training setup. In these examples,
cfg
refers to aConfigContainer
instance (such as one produced by a recipe), which contains alogger
attribute representing theLoggerConfig
:
from megatron.bridge.training.config import LoggerConfig
cfg.logger = LoggerConfig(
tensorboard_dir="./runs/tensorboard",
tensorboard_log_interval=10,
log_timers_to_tensorboard=True, # optional
log_memory_to_tensorboard=False, # optional
)
Note
The writer is created lazily on the last rank when tensorboard_dir
is set.
Set the Output Directory#
TensorBoard event files are saved to the directory specified by tensorboard_dir
.
Example with additional metrics enabled:
cfg.logger.tensorboard_dir = "./logs/tb"
cfg.logger.tensorboard_log_interval = 5
cfg.logger.log_loss_scale_to_tensorboard = True
cfg.logger.log_validation_ppl_to_tensorboard = True
cfg.logger.log_world_size_to_tensorboard = True
cfg.logger.log_timers_to_tensorboard = True
Weights & Biases (W&B)#
What Gets Logged#
When enabled, W&B automatically mirrors the scalar metrics logged to TensorBoard.
In addition, the full run configuration is synced at initialization, allowing for reproducibility and experiment tracking.
Enable W&B Logging#
Install W&B (if not already available):
pip install wandb
Authenticate with W&B using one of the following methods:
Set
WANDB_API_KEY
in the environment before the run, orRun
wandb login
once on the machine.
Configure logging in your training setup. In these examples,
cfg
refers to aConfigContainer
instance (such as one produced by a recipe), which contains alogger
attribute representing theLoggerConfig
:
from megatron.bridge.training.config import LoggerConfig
cfg.logger = LoggerConfig(
tensorboard_dir="./runs/tensorboard", # recommended: enables shared logging gate
wandb_project="my_project",
wandb_exp_name="my_experiment",
wandb_entity="my_team", # optional
wandb_save_dir="./runs/wandb", # optional
)
Note
W&B is initialized lazily on the last rank when wandb_project
is set and wandb_exp_name
is non-empty.
W&B Configuration with NeMo Run Launching#
For users launching training scripts with NeMo Run, W&B can be optionally configured using the bridge.recipes.run_plugins.WandbPlugin
.
The plugin automatically forwards the WANDB_API_KEY
and by default injects CLI overrides for the following logger parameters:
logger.wandb_project
logger.wandb_entity
logger.wandb_exp_name
logger.wandb_save_dir
This allows seamless integration of W&B logging into your training workflow without manual configuration.
Progress Log#
When logger.log_progress
is enabled, the framework generates a progress.txt
file in the checkpoint save directory.
This file includes:
Job-level metadata, such as timestamp and GPU count
Periodic progress entries throughout training
At each checkpoint boundary, the log is updated with:
Job throughput (TFLOP/s/GPU)
Cumulative throughput
Total floating-point operations
Tokens processed
This provides a lightweight, text-based audit trail of training progress, useful for tracking performance across restarts.
Console Logging#
Megatron Bridge uses the standard Python logging subsystem for console output.
Configure Console Logging#
To control console logging behavior, use the following configuration options:
logging_level
sets the default verbosity level. It can be overridden via theMEGATRON_BRIDGE_LOGGING_LEVEL
environment variable.filter_warnings
suppresses messages at the WARNING level.modules_to_filter
specifies logger name prefixes to exclude from output.set_level_for_all_loggers
determines whether the logging level is applied to all loggers or only a subset, depending on the current implementation.
Monitor Logging Cadence and Content#
To monitor training progress at regular intervals, the framework prints a summary line every log_interval
iterations.
Each summary includes:
Timestamp
Iteration counters
Consumed and skipped samples
Iteration time (ms)
Learning rates
Global batch size
Per-loss averages
Loss scale
When enabled, additional metrics are printed:
Gradient norm
Zeros in gradients
Parameter norm
Energy and power per GPU
Straggler timing reports follow the same log_interval
cadence, helping identify performance bottlenecks across ranks.
Minimize Timing Overhead#
To reduce performance impact, set timing_log_level
to 0
.
Increase to 1
or 2
only when more detailed timing metrics are required, as higher levels introduce additional logging overhead.