`nemo_rl.utils.logger`#

Module Contents#

Classes#

`WandbConfig`
`TensorboardConfig`
`MLflowConfig`
`GPUMonitoringConfig`
`LoggerConfig`
`LoggerInterface`	Abstract base class for logger backends.
`TensorboardLogger`	Tensorboard logger backend.
`WandbLogger`	Weights & Biases logger backend.
`GpuMetricSnapshot`
`RayGpuMonitorLogger`	Monitor GPU utilization across a Ray cluster and log metrics to a parent logger.
`MLflowLogger`	MLflow logger backend.
`Logger`	Main logger class that delegates to multiple backend loggers.

Functions#

`flatten_dict`	Flatten a nested dictionary.
`configure_rich_logging`	Configure rich logging for more visually appealing log output.
`print_message_log_samples`	Visualization for message logs and rewards using a more visual approach with emoji indicators and horizontal layout.
`get_next_experiment_dir`	Create a new experiment directory with an incremented ID.

Data#

_rich_logging_configured

API#

nemo_rl.utils.logger._rich_logging_configured#: False

class nemo_rl.utils.logger.WandbConfig[source]#

Bases: typing.TypedDict

project: NotRequired[str]#: None

name: NotRequired[str]#: None

class nemo_rl.utils.logger.TensorboardConfig[source]#

Bases: typing.TypedDict

log_dir: NotRequired[str]#: None

class nemo_rl.utils.logger.MLflowConfig[source]#

Bases: typing.TypedDict

experiment_name: str#: None

run_name: str#: None

tracking_uri: NotRequired[str]#: None

class nemo_rl.utils.logger.GPUMonitoringConfig[source]#

Bases: typing.TypedDict

collection_interval: int | float#: None

flush_interval: int | float#: None

class nemo_rl.utils.logger.LoggerConfig[source]#

Bases: typing.TypedDict

log_dir: str#: None

wandb_enabled: bool#: None

tensorboard_enabled: bool#: None

mlflow_enabled: bool#: None

wandb: nemo_rl.utils.logger.WandbConfig#: None

tensorboard: nemo_rl.utils.logger.TensorboardConfig#: None

mlflow: NotRequired[nemo_rl.utils.logger.MLflowConfig]#: None

monitor_gpus: bool#: None

gpu_monitoring: nemo_rl.utils.logger.GPUMonitoringConfig#: None

num_val_samples_to_print: int#: None

class nemo_rl.utils.logger.LoggerInterface[source]#

Bases: abc.ABC

Abstract base class for logger backends.

abstractmethod log_metrics( metrics: dict[str, Any], step: int, prefix: Optional[str] = '', step_metric: Optional[str] = None, ) → None[source]#: Log a dictionary of metrics.

abstractmethod log_hyperparams(params: Mapping[str, Any]) → None[source]#: Log dictionary of hyperparameters.

class nemo_rl.utils.logger.TensorboardLogger( cfg: nemo_rl.utils.logger.TensorboardConfig, log_dir: Optional[str] = None, )[source]#

Bases: nemo_rl.utils.logger.LoggerInterface

Tensorboard logger backend.

Initialization

log_metrics( metrics: dict[str, Any], step: int, prefix: Optional[str] = '', step_metric: Optional[str] = None, ) → None[source]#

Log metrics to Tensorboard.

Parameters:

metrics – Dict of metrics to log
step – Global step value
prefix – Optional prefix for metric names
step_metric – Optional step metric name (ignored in TensorBoard)

log_hyperparams(params: Mapping[str, Any]) → None[source]#

Log hyperparameters to Tensorboard.

Parameters:: params – Dictionary of hyperparameters to log

log_plot( figure: matplotlib.pyplot.Figure, step: int, name: str, ) → None[source]#

Log a plot to Tensorboard.

Parameters:

plot_data – Dictionary of plot data
step – Global step value

class nemo_rl.utils.logger.WandbLogger( cfg: nemo_rl.utils.logger.WandbConfig, log_dir: Optional[str] = None, )[source]#

Bases: nemo_rl.utils.logger.LoggerInterface

Weights & Biases logger backend.

Initialization

_log_diffs()[source]#

Log git diffs to wandb.

This function captures and logs two types of diffs:

Uncommitted changes (working tree diff against HEAD)
All changes (including uncommitted) against the main branch

Each diff is saved as a text file in a wandb artifact.

_log_code()[source]#

Log code that is tracked by git to wandb.

This function gets a list of all files tracked by git in the project root and manually uploads them to the current wandb run as an artifact.

define_metric( name: str, step_metric: Optional[str] = None, ) → None[source]#

Define a metric with custom step metric.

Parameters:

name – Name of the metric or pattern (e.g. ‘ray/*’)
step_metric – Optional name of the step metric to use

log_metrics( metrics: dict[str, Any], step: int, prefix: Optional[str] = '', step_metric: Optional[str] = None, ) → None[source]#

Log metrics to wandb.

Parameters:

metrics – Dict of metrics to log
step – Global step value
prefix – Optional prefix for metric names
step_metric – Optional name of a field in metrics to use as step instead of the provided step value

log_hyperparams(params: Mapping[str, Any]) → None[source]#

Log hyperparameters to wandb.

Parameters:: params – Dict of hyperparameters to log

log_plot( figure: matplotlib.pyplot.Figure, step: int, name: str, ) → None[source]#

Log a plot to wandb.

Parameters:

figure – Matplotlib figure to log
step – Global step value

class nemo_rl.utils.logger.GpuMetricSnapshot[source]#

Bases: typing.TypedDict

step: int#: None

metrics: dict[str, Any]#: None

class nemo_rl.utils.logger.RayGpuMonitorLogger( collection_interval: int | float, flush_interval: int | float, metric_prefix: str, step_metric: str, parent_logger: Optional[nemo_rl.utils.logger.Logger] = None, )[source]#

Monitor GPU utilization across a Ray cluster and log metrics to a parent logger.

Initialization

Initialize the GPU monitor.

Parameters:

collection_interval – Interval in seconds to collect GPU metrics
flush_interval – Interval in seconds to flush metrics to parent logger
step_metric – Name of the field to use as the step metric
parent_logger – Logger to receive the collected metrics

start() → None[source]#: Start the GPU monitoring thread.

stop() → None[source]#: Stop the GPU monitoring thread.

_collection_loop() → None[source]#: Main collection loop that runs in a separate thread.

_parse_metric( sample: prometheus_client.samples.Sample, node_idx: int, ) → dict[str, Any][source]#

Parse a metric sample into a standardized format.

Parameters:

sample – Prometheus metric sample
node_idx – Index of the node

Returns:

Dictionary with metric name and value

_parse_gpu_sku( sample: prometheus_client.samples.Sample, node_idx: int, ) → dict[str, str][source]#

Parse a GPU metric sample into a standardized format.

Parameters:

sample – Prometheus metric sample
node_idx – Index of the node

Returns:

Dictionary with metric name and value

_collect_gpu_sku() → dict[str, str][source]#

Collect GPU SKU from all Ray nodes.

Note: This is an internal API and users are not expected to call this.

Returns:: Dictionary of SKU types on all Ray nodes

_collect_metrics() → dict[str, Any][source]#

Collect GPU metrics from all Ray nodes.

Returns:: Dictionary of collected metrics

_collect( metrics: bool = False, sku: bool = False, ) → dict[str, Any][source]#

Collect GPU metrics from all Ray nodes.

Returns:: Dictionary of collected metrics

_fetch_and_parse_metrics( node_idx: int, metric_address: str, parser_fn: Callable, )[source]#

Fetch metrics from a node and parse GPU metrics.

Parameters:

node_idx – Index of the node
metric_address – Address of the metrics endpoint

Returns:

Dictionary of GPU metrics

flush() → None[source]#: Flush collected metrics to the parent logger.

class nemo_rl.utils.logger.MLflowLogger( cfg: nemo_rl.utils.logger.MLflowConfig, log_dir: Optional[str] = None, )[source]#

Bases: nemo_rl.utils.logger.LoggerInterface

MLflow logger backend.

Initialization

Initialize MLflow logger.

Parameters:

cfg – MLflow configuration
log_dir – Optional log directory

log_metrics( metrics: dict[str, Any], step: int, prefix: Optional[str] = '', step_metric: Optional[str] = None, ) → None[source]#

Log metrics to MLflow.

Parameters:

metrics – Dict of metrics to log
step – Global step value
prefix – Optional prefix for metric names
step_metric – Optional step metric name (ignored in MLflow)

log_hyperparams(params: Mapping[str, Any]) → None[source]#

Log hyperparameters to MLflow.

Parameters:: params – Dictionary of hyperparameters to log

log_plot( figure: matplotlib.pyplot.Figure, step: int, name: str, ) → None[source]#

Log a plot to MLflow.

Parameters:

figure – Matplotlib figure to log
step – Global step value
name – Name of the plot

__del__() → None[source]#: Clean up resources when the logger is destroyed.

class nemo_rl.utils.logger.Logger(cfg: nemo_rl.utils.logger.LoggerConfig)[source]#

Bases: nemo_rl.utils.logger.LoggerInterface

Main logger class that delegates to multiple backend loggers.

Initialization

Initialize the logger.

Parameters:

cfg –

Config dict with the following keys:

wandb_enabled
tensorboard_enabled
mlflow_enabled
wandb
tensorboard
mlflow
monitor_gpus
gpu_collection_interval
gpu_flush_interval

log_metrics( metrics: dict[str, Any], step: int, prefix: Optional[str] = '', step_metric: Optional[str] = None, ) → None[source]#

Log metrics to all enabled backends.

Parameters:

metrics – Dict of metrics to log
step – Global step value
prefix – Optional prefix for metric names
step_metric – Optional name of a field in metrics to use as step instead of the provided step value (currently only needed for wandb)

log_hyperparams(params: Mapping[str, Any]) → None[source]#

Log hyperparameters to all enabled backends.

Parameters:: params – Dict of hyperparameters to log

log_batched_dict_as_jsonl( to_log: nemo_rl.distributed.batched_data_dict.BatchedDataDict[Any] | dict[str, Any], filename: str, ) → None[source]#

Log a list of dictionaries to a JSONL file.

Parameters:

to_log – BatchedDataDict to log
filename – Filename to log to (within the log directory)

log_plot_token_mult_prob_error( data: dict[str, Any], step: int, name: str, ) → None[source]#

Log a plot of log probability errors in samples.

This function logs & plots the per-token log-probabilities and errors over the sequence for the sample with the highest multiplicative probability error in the batch.

Parameters:

log_data – Dictionary of log probability samples
step – Global step value
name – Name of the plot

__del__() → None[source]#: Clean up resources when the logger is destroyed.

nemo_rl.utils.logger.flatten_dict( d: Mapping[str, Any], sep: str = '.', ) → dict[str, Any][source]#

Flatten a nested dictionary.

Handles nested dictionaries and lists by creating keys with separators. For lists, the index is used as part of the key.

Parameters:

d – Dictionary to flatten
sep – Separator to use between nested keys

Returns:

Flattened dictionary with compound keys

.. rubric:: Examples

>>> from nemo_rl.utils.logger import flatten_dict
>>> flatten_dict({"a": 1, "b": {"c": 2}})
{'a': 1, 'b.c': 2}

>>> flatten_dict({"a": [1, 2], "b": {"c": [3, 4]}})
{'a.0': 1, 'a.1': 2, 'b.c.0': 3, 'b.c.1': 4}

>>> flatten_dict({"a": [{"b": 1}, {"c": 2}]})
{'a.0.b': 1, 'a.1.c': 2}

nemo_rl.utils.logger.configure_rich_logging( level: str = 'INFO', show_time: bool = True, show_path: bool = True, ) → None[source]#

Configure rich logging for more visually appealing log output.

Parameters:

level – The logging level to use
show_time – Whether to show timestamps in logs
show_path – Whether to show file paths in logs

nemo_rl.utils.logger.print_message_log_samples( message_logs: list[nemo_rl.data.interfaces.LLMMessageLogType], rewards: list[float], num_samples: int = 5, step: int = 0, ) → None[source]#

Visualization for message logs and rewards using a more visual approach with emoji indicators and horizontal layout.

Parameters:

message_logs – List of message logs to sample from
rewards – List of rewards corresponding to each message log
num_samples – Number of samples to display (default: 5)
step – Current training step (for display purposes)

nemo_rl.utils.logger.get_next_experiment_dir(base_log_dir: str) → str[source]#

Create a new experiment directory with an incremented ID.

Parameters:: base_log_dir (str) – The base log directory path
Returns:: Path to the new experiment directory with incremented ID
Return type:: str

nemo_rl.utils.logger#

Module Contents#

Classes#

Functions#

Data#

API#

`nemo_rl.utils.logger`#