Logger#

The logger is designed to track key training metrics (including distributed metrics with reductions and timing), as well as providing integration with logging backends like WandB, Tensorboard, and MLflow.

Requirements#

Tracking distributed metrics with specified reductions (mean, max, etc.)
Tracking distributed timing with (usually) ‘max’ reduction across ranks
Logging:
- WandB
- Tensorboard
- MLflow

Overall Design#

Since there is a single controller, the single process running the main training loop will gather the metrics and do the logging.

To handle multiple logger backends, we will have a LoggerInterface interface that the TensorboardLogger, WandbLogger, and MLflowLogger will implement:

class LoggerInterface(ABC):
    """Abstract base class for logger backends."""

    @abstractmethod
    def log_metrics(self, metrics: dict[str, Any], step: int, prefix: Optional[str]: "") -> None:
        """Log a dictionary of metrics."""
        pass

    @abstractmethod
    def log_hyperparams(self, params: dict[str, Any]) -> None:
        """Log dictionary of hyperparameters."""
        pass

A Logger wrapper class will also implement LoggerInterface and maintain a list of loggers to which it delegates writing logs. This will be the main class the user uses in the training loop. Usage example:

# Initialize logger with wandb, tensorboard, and mlflow enabled
logging_config = {
    "wandb_enabled": True,
    "tensorboard_enabled": False,
    "mlflow_enabled": True,

    "wandb": {
        "project": "grpo-dev",
        "name": "grpo-dev-logging",
    },
    "tensorboard": {
        "log_dir": "logs",
    },
    "mlflow": {
        "experiment_name": "nemo-rl-experiment",
        "run_name": "grpo-dev-run",
        "tracking_uri": None,  # Use local tracking
    },
}
logger = Logger(
    cfg=logger_config,
)

# Log metrics, will go to all enabled backends
logger.log_metrics({
    "loss": 0.123,
}, step=10)

Supported Logging Backends#

The logger supports three main logging backends:

WandB (Weights & Biases)#

Provides cloud-based experiment tracking
Supports custom step metrics for better visualization
Includes built-in hyperparameter logging
Offers rich visualization and collaboration features

Tensorboard#

Local file-based logging
Standard TensorBoard visualization
Supports hyperparameter logging via HParams
Lightweight and self-contained

MLflow#

Comprehensive platform for experiment tracking and model management
Supports both local and remote tracking servers
Provides model versioning and artifact management
Includes a web UI for experiment visualization
Supports model deployment and serving

MLflow Configuration#

MLflow can be configured with the following parameters:

mlflow:
  experiment_name: "nemo-rl-experiment"  # Name of the MLflow experiment
  run_name: "my-training-run"            # Run name
  tracking_uri: "http://localhost:5000"  # Optional tracking server URI

MLflow UI#

After starting training with MLflow enabled, you can view the MLflow UI to monitor your experiments:

# Start MLflow UI (run in a separate terminal)
mlflow ui --host 0.0.0.0 --port 5000

Then access the UI at http://127.0.0.1:5000/ to view:

Training runs and experiments
Metrics (loss, validation metrics, etc.)
Hyperparameters
Model artifacts and checkpoints

Validation Pretty Logging#

The logger supports pretty-formatted logging of validation samples to help visualize model outputs during training. This feature is controlled by the num_val_samples_to_print configuration parameter.

logger:
  wandb_enabled: false
  tensorboard_enabled: false
  mlflow_enabled: false
  num_val_samples_to_print: 10

When num_val_samples_to_print is set to a value greater than 0, the logger will generate well-formatted text outputs for the specified number of validation samples. This is particularly useful for:

Quickly inspecting model generation quality during training.
Comparing inputs and outputs side-by-side.
Tracking validation sample performance over time.

Example Output#

When enabled, the pretty logging will generate formatted text similar to:

Validation Pretty Logging Example

GPU Metric Logging#

NeMo RL monitors GPU memory and utilization through system metrics exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, NeMo RL directly polls GPU memory and utilization data and logs them to TensorBoard, WandB, and/or MLflow.

This approach allows us to offer the same GPU metric tracking on all loggers and simplifies the implementation greatly.

This feature is enabled with the monitor_gpus configuration parameter. The frequency of data collection and flushing to the loggers is controlled by the gpu_collection_interval and gpu_flush_interval parameters, both specified in seconds.

logger:
  wandb_enabled: false
  tensorboard_enabled: false
  mlflow_enabled: false
  monitor_gpus: true
  gpu_monitoring:
    collection_interval: 10
    flush_interval: 10

Note

While it is feasible to monitor using remote workers, the implementation requires careful attention to details to ensure:

Logs sent back to the driver do not introduce significant overhead.
Metrics remain clear and interpretable, avoiding issues like double counting caused by colocated workers.
Workers can gracefully flush their logs in case of failure.
Logging behaves consistently across TensorBoard, WandB, and MLflow.
Workers that spawn other workers accurately report the total resource usage of any grandchild workers.

Due to these complexities, we opted for a simpler approach: collecting metrics exposed by the Ray metrics server from the driver.