Logger#
The logger is designed to track key training metrics (including distributed metrics with reductions and timing), as well as providing integration with logging backends like WandB, Tensorboard, and MLflow.
Requirements#
Tracking distributed metrics with specified reductions (mean, max, etc.)
Tracking distributed timing with (usually) ‘max’ reduction across ranks
Logging:
WandB
Tensorboard
MLflow
Overall Design#
Since there is a single controller, the single process running the main training loop will gather the metrics and do the logging.
To handle multiple logger backends, we will have a LoggerInterface
interface that the TensorboardLogger
, WandbLogger
, and MLflowLogger
will implement:
class LoggerInterface(ABC):
"""Abstract base class for logger backends."""
@abstractmethod
def log_metrics(self, metrics: dict[str, Any], step: int, prefix: Optional[str]: "") -> None:
"""Log a dictionary of metrics."""
pass
@abstractmethod
def log_hyperparams(self, params: dict[str, Any]) -> None:
"""Log dictionary of hyperparameters."""
pass
A Logger
wrapper class will also implement LoggerInterface
and maintain a list of loggers to which it delegates writing logs. This will be the main class the user uses in the training loop. Usage example:
# Initialize logger with wandb, tensorboard, and mlflow enabled
logging_config = {
"wandb_enabled": True,
"tensorboard_enabled": False,
"mlflow_enabled": True,
"wandb": {
"project": "grpo-dev",
"name": "grpo-dev-logging",
},
"tensorboard": {
"log_dir": "logs",
},
"mlflow": {
"experiment_name": "nemo-rl-experiment",
"run_name": "grpo-dev-run",
"tracking_uri": None, # Use local tracking
},
}
logger = Logger(
cfg=logger_config,
)
# Log metrics, will go to all enabled backends
logger.log_metrics({
"loss": 0.123,
}, step=10)
Supported Logging Backends#
The logger supports three main logging backends:
WandB (Weights & Biases)#
Provides cloud-based experiment tracking
Supports custom step metrics for better visualization
Includes built-in hyperparameter logging
Offers rich visualization and collaboration features
Tensorboard#
Local file-based logging
Standard TensorBoard visualization
Supports hyperparameter logging via HParams
Lightweight and self-contained
MLflow#
Comprehensive platform for experiment tracking and model management
Supports both local and remote tracking servers
Provides model versioning and artifact management
Includes a web UI for experiment visualization
Supports model deployment and serving
MLflow Configuration#
MLflow can be configured with the following parameters:
mlflow:
experiment_name: "nemo-rl-experiment" # Name of the MLflow experiment
run_name: "my-training-run" # Run name
tracking_uri: "http://localhost:5000" # Optional tracking server URI
MLflow UI#
After starting training with MLflow enabled, you can view the MLflow UI to monitor your experiments:
# Start MLflow UI (run in a separate terminal)
mlflow ui --host 0.0.0.0 --port 5000
Then access the UI at http://127.0.0.1:5000/
to view:
Training runs and experiments
Metrics (loss, validation metrics, etc.)
Hyperparameters
Model artifacts and checkpoints
Validation Pretty Logging#
The logger supports pretty-formatted logging of validation samples to help visualize model outputs during training. This feature is controlled by the num_val_samples_to_print
configuration parameter.
logger:
wandb_enabled: false
tensorboard_enabled: false
mlflow_enabled: false
num_val_samples_to_print: 10
When num_val_samples_to_print
is set to a value greater than 0, the logger will generate well-formatted text outputs for the specified number of validation samples. This is particularly useful for:
Quickly inspecting model generation quality during training.
Comparing inputs and outputs side-by-side.
Tracking validation sample performance over time.
Example Output#
When enabled, the pretty logging will generate formatted text similar to:
GPU Metric Logging#
NeMo RL monitors GPU memory and utilization through system metrics exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, NeMo RL directly polls GPU memory and utilization data and logs them to TensorBoard, WandB, and/or MLflow.
This approach allows us to offer the same GPU metric tracking on all loggers and simplifies the implementation greatly.
This feature is enabled with the monitor_gpus
configuration parameter. The frequency of data collection and flushing to the loggers is controlled by the gpu_collection_interval
and gpu_flush_interval
parameters, both specified in seconds.
logger:
wandb_enabled: false
tensorboard_enabled: false
mlflow_enabled: false
monitor_gpus: true
gpu_monitoring:
collection_interval: 10
flush_interval: 10
Note
While it is feasible to monitor using remote workers, the implementation requires careful attention to details to ensure:
Logs sent back to the driver do not introduce significant overhead.
Metrics remain clear and interpretable, avoiding issues like double counting caused by colocated workers.
Workers can gracefully flush their logs in case of failure.
Logging behaves consistently across TensorBoard, WandB, and MLflow.
Workers that spawn other workers accurately report the total resource usage of any grandchild workers.
Due to these complexities, we opted for a simpler approach: collecting metrics exposed by the Ray metrics server from the driver.