Logging and Monitoring#
This guide describes how to configure logging in Megatron Bridge. It introduces the high-level LoggerConfig, explains experiment logging to TensorBoard and Weights & Biases (W&B), and documents console logging behavior.
LoggerConfig Overview#
LoggerConfig is the dataclass that encapsulates logging‑related settings for training. It resides inside the overall bridge.training.config.ConfigContainer, which represents the complete configuration for a training run.
Timer Configuration Options#
Use the following options to control which timing metrics are collected during training and how they are aggregated and logged.
timing_log_level#
Controls which timers are recorded during execution:
Level 0: Logs only the overall iteration time.
Level 1: Includes once-per-iteration operations, such as gradient all-reduce.
Level 2: Captures frequently executed operations, providing more detailed insights but with increased overhead.
timing_log_option#
Specifies how timer values are aggregated across ranks. Valid options:
"max": Logs the maximum value across ranks."minmax": Logs both minimum and maximum values."all": Logs all values from all ranks.
log_timers_to_tensorboard#
When enabled, the framework records timer metrics to supported backends such as TensorBoard.
Diagnostic Options#
The framework provides several optional toggles for enhanced monitoring and diagnostics:
Loss Scale: Enables dynamic loss scaling for mixed-precision training.
Validation Perplexity: Tracks model perplexity during validation.
CUDA Memory Statistics: Reports detailed GPU memory usage.
World Size: Displays the total number of distributed ranks.
Logging Options#
Use the following options to enable additional diagnostics and performance monitoring during training.
log_params_norm: Computes and logs the L2 norm of model parameters. If available, it also logs the gradient norm.log_energy: Activates the energy monitor, which records per-GPU energy consumption and instantaneous power usage.log_memory: Logs the memory usage of the model fromtorch.cuda.memory_stats().log_throughput_to_tensorboard: Calculates the training throughput and utilization.log_runtime_to_tensorboard: Estimates total time remaining until the end of the training.log_l2_norm_grad_to_tensorboard: Computes and logs the L2 norm of gradients for each model layer.
Experiment Logging#
Both TensorBoard and W&B are supported for metric logging. When using W&B, it’s recommended to also enable TensorBoard to ensure that all scalar metrics are consistently logged across backends.
TensorBoard#
What Gets Logged#
TensorBoard captures a range of training and system metrics, including:
Learning rate, including decoupled LR when applicable
Per-loss scalars for detailed breakdowns
Batch size and loss scale
CUDA memory usage and world size (if enabled)
Validation loss, with optional perplexity
Timers, when timing is enabled
Energy consumption and instantaneous power, if energy logging is active
Enable TensorBoard Logging#
Install TensorBoard (if not already available):
pip install tensorboard
Configure logging in your training setup. In these examples,
cfgrefers to aConfigContainerinstance (such as one produced by a recipe), which contains aloggerattribute representing theLoggerConfig:
from megatron.bridge.training.config import LoggerConfig
cfg.logger = LoggerConfig(
tensorboard_dir="./runs/tensorboard",
tensorboard_log_interval=10,
log_timers_to_tensorboard=True, # optional
log_memory_to_tensorboard=False, # optional
)
Note
The writer is created lazily on the last rank when tensorboard_dir is set.
Set the Output Directory#
TensorBoard event files are saved to the directory specified by tensorboard_dir.
Example with additional metrics enabled:
cfg.logger.tensorboard_dir = "./logs/tb"
cfg.logger.tensorboard_log_interval = 5
cfg.logger.log_loss_scale_to_tensorboard = True
cfg.logger.log_validation_ppl_to_tensorboard = True
cfg.logger.log_world_size_to_tensorboard = True
cfg.logger.log_timers_to_tensorboard = True
Weights & Biases (W&B)#
What Gets Logged#
When enabled, W&B automatically mirrors the scalar metrics logged to TensorBoard.
In addition, the full run configuration is synced at initialization, allowing for reproducibility and experiment tracking.
Enable W&B Logging#
Install W&B (if not already available):
pip install wandb
Authenticate with W&B using one of the following methods:
Set
WANDB_API_KEYin the environment before the run, orRun
wandb loginonce on the machine.
Configure logging in your training setup. In these examples,
cfgrefers to aConfigContainerinstance (such as one produced by a recipe), which contains aloggerattribute representing theLoggerConfig:
from megatron.bridge.training.config import LoggerConfig
cfg.logger = LoggerConfig(
tensorboard_dir="./runs/tensorboard", # recommended: enables shared logging gate
wandb_project="my_project",
wandb_exp_name="my_experiment",
wandb_entity="my_team", # optional
wandb_save_dir="./runs/wandb", # optional
)
Note
W&B is initialized lazily on the last rank when wandb_project is set and wandb_exp_name is non-empty.
W&B Configuration with NeMo Run Launching#
For users launching training scripts with NeMo Run, W&B can be optionally configured using the bridge.recipes.run_plugins.WandbPlugin.
The plugin automatically forwards the WANDB_API_KEY and by default injects CLI overrides for the following logger parameters:
logger.wandb_projectlogger.wandb_entitylogger.wandb_exp_namelogger.wandb_save_dir
This allows seamless integration of W&B logging into your training workflow without manual configuration.
Progress Log#
When logger.log_progress is enabled, the framework generates a progress.txt file in the checkpoint save directory.
This file includes:
Job-level metadata, such as timestamp and GPU count
Periodic progress entries throughout training
At each checkpoint boundary, the log is updated with:
Job throughput (TFLOP/s/GPU)
Cumulative throughput
Total floating-point operations
Tokens processed
This provides a lightweight, text-based audit trail of training progress, useful for tracking performance across restarts.
Tensor Inspection#
Megatron Bridge integrates with TransformerEngine’s tensor inspection features via NVIDIA DLFW Inspect. This integration, controlled by TensorInspectConfig, enables advanced debugging and analysis of tensor statistics during training. When enabled, the framework handles initialization, step tracking, and cleanup automatically.
Note
Current limitations: Tensor inspection is currently supported only for linear modules in TransformerEngine (e.g., fc1, fc2, layernorm_linear). Operations like attention are not supported.
Note
This section covers Megatron Bridge configuration. For comprehensive documentation on features, configuration syntax, and advanced usage, see:
Installation#
Install NVIDIA DLFW Inspect if not already available:
pip install nvdlfw-inspect
Available Features#
TransformerEngine provides the following debug features:
LogTensorStats – Logs high-precision tensor statistics:
min,max,mean,std,l1_norm,l2_norm,cur_amax,dynamic_range.LogFp8TensorStats – Logs quantized tensor statistics for FP8 recipes:
underflows%,scale_inv_min,scale_inv_max,mse. Supports simulating alternative recipes (e.g., trackingmxfp8_underflows%during per-tensor current-scaling training)DisableFP8GEMM – Runs specific GEMM operations in high precision
DisableFP8Layer – Disables FP8 for entire layers
PerTensorScaling – Enables per-tensor current scaling for specific tensors
FakeQuant – Experimental quantization testing
See TransformerEngine debug features for complete parameter lists and usage details.
Configuration#
Configure tensor inspection using TensorInspectConfig with either a YAML file or inline dictionary.
YAML Configuration#
tensor_inspect:
enabled: true
features: ./conf/fp8_tensor_stats.yaml
log_dir: ./logs/tensor_inspect
Example feature configuration file:
fp8_tensor_stats:
enabled: true
layers:
layer_name_regex_pattern: ".*(fc2)"
transformer_engine:
LogFp8TensorStats:
enabled: true
tensors: [weight,activation,gradient]
stats: ["underflows%", "mse"]
freq: 5
start_step: 0
end_step: 100
Python Configuration#
from bridge.training.config import TensorInspectConfig
# Option 1: inline python dict
cfg.tensor_inspect = TensorInspectConfig(
enabled=True,
features={
"fp8_gradient_stats": {
"enabled": True,
"layers": {"layer_name_regex_pattern": ".*(fc1|fc2)"},
"transformer_engine": {
"LogFp8TensorStats": {
"enabled": True,
"tensors": ["weight","activation","gradient"],
"stats": ["underflows%", "mse"],
"freq": 5,
"start_step": 0,
"end_step": 100,
},
},
}
},
log_dir="./logs/tensor_inspect",
)
# Option 2: reference external YAML
cfg.tensor_inspect = TensorInspectConfig(
enabled=True,
features="./conf/fp8_inspect.yaml",
log_dir="./logs/tensor_inspect",
)
Layer Selection#
Features apply to linear modules matched by selectors in the layers section:
layer_name_regex_pattern: .*– All supported linear layerslayer_name_regex_pattern: .*layers\.(0|1|2).*(fc1|fc2|layernorm_linear)– Linear modules in first three transformer layerslayer_name_regex_pattern: .*(fc1|fc2)– MLP projections onlylayer_types: [layernorm_linear, fc1]– String matching (alternative to regex)
Tensor-level selectors (tensors, tensors_struct) control which tensor roles are logged: activation, gradient, weight, output, wgrad, dgrad.
Output and Monitoring#
Tensor statistics are written to tensor_inspect.log_dir and forwarded to TensorBoard/W&B when enabled.
Log locations:
Text logs:
<log_dir>/nvdlfw_inspect_statistics_logs/TensorBoard
W&B
Performance Considerations#
Use
freq > 1to reduce overhead. Statistics collection is expensive for large models.Narrow layer selection with specific regex patterns rather than
.*
Console Logging#
Megatron Bridge uses the standard Python logging subsystem for console output.
Configure Console Logging#
To control console logging behavior, use the following configuration options:
logging_levelsets the default verbosity level. It can be overridden via theMEGATRON_BRIDGE_LOGGING_LEVELenvironment variable.filter_warningssuppresses messages at the WARNING level.modules_to_filterspecifies logger name prefixes to exclude from output.set_level_for_all_loggersdetermines whether the logging level is applied to all loggers or only a subset, depending on the current implementation.
Monitor Logging Cadence and Content#
To monitor training progress at regular intervals, the framework prints a summary line every log_interval iterations.
Each summary includes:
Timestamp
Iteration counters
Consumed and skipped samples
Iteration time (ms)
Learning rates
Global batch size
Per-loss averages
Loss scale
When enabled, additional metrics are printed:
Gradient norm
Zeros in gradients
Parameter norm
Energy and power per GPU
Straggler timing reports follow the same log_interval cadence, helping identify performance bottlenecks across ranks.
Minimize Timing Overhead#
To reduce performance impact, set timing_log_level to 0.
Increase to 1 or 2 only when more detailed timing metrics are required, as higher levels introduce additional logging overhead.