PhysicsNeMo Launch Logging#

The PhysicsNeMo Launch Logging module provides a comprehensive and flexible logging system for machine learning experiments and physics simulations. It offers multiple logging backends including console output, MLflow, and Weights & Biases (W&B), allowing users to track metrics, artifacts, and experiment parameters across different platforms. The module is designed to work seamlessly in both single-process and distributed training environments.

Key Features: - Unified logging interface across different backends - Support for distributed training environments - Automatic metric aggregation and synchronization - Flexible configuration and customization options - Integration with popular experiment tracking platforms

Consider the following example usage:

from physicsnemo.launch.logging import LaunchLogger

# Initialize the logger
logger = LaunchLogger.initialize(use_mlflow=True)

# Training loop
for epoch in range(num_epochs):

    # Training logger
    with LaunchLogger(
        "train", epoch = epoch, num_mini_batch = len(training_datapipe), epoch_alert_freq = 1
    ) as logger:
        for batch in training_datapipe:
            # Training loop
            ... # training code
            logger.log_metrics({"train_loss": training_loss})

    # Validation logger
    with LaunchLogger(
        "val", epoch = epoch, num_mini_batch = len(validation_datapipe), epoch_alert_freq = 1
    ) as logger:
        for batch in validation_datapipe:
            # Validation loop
            ... # validation code
            logger.log_minibatch({"val_loss": validation_loss})

    learning_rate = ... # get the learning rate at the end of the epoch from the optimizer
    logger.log_epoch({"learning_rate": learning_rate}) # log the learning rate at the end of the epoch

This example shows how to use the LaunchLogger to log metrics during training and validation. The LaunchLogger is initialized with the MLflow backend, and the logger is created for each epoch, a separate logger is created for training and validation. We can use the .log_minibatch method to log metrics during training and validation. We can use the .log_epoch method to log the learning rate at the end of the epoch.

For a more detailed example, please refer to the Logging and Checkpointing recipe .

Launch Logger#

The LaunchLogger serves as the primary interface for logging in PhysicsNeMo. It provides a unified API that works consistently across different logging backends and training environments. The logger automatically handles metric aggregation in distributed settings and ensures proper synchronization across processes.

class physicsnemo.launch.logging.launch.LaunchLogger(name_space, *args, **kwargs)[source]#

Bases: object

PhysicsNeMo Launch logger

An abstracted logger class that takes care of several fundamental logging functions. This class should first be initialized and then used via a context manager. This will auto compute epoch metrics. This is the standard logger for PhysicsNeMo examples.

Parameters:

name_space (str) – Namespace of logger to use. This will define the loggers title in the console and the wandb group the metric is plotted
epoch (int, optional) – Current epoch, by default 1
num_mini_batch (Union[int, None], optional) – Number of mini-batches used to calculate the epochs progress, by default None
profile (bool, optional) – Profile code using nvtx markers, by default False
mini_batch_log_freq (int, optional) – Frequency to log mini-batch losses, by default 100
epoch_alert_freq (Union[int, None], optional) – Epoch frequency to send training alert, by default None

Example

>>> from physicsnemo.launch.logging import LaunchLogger
>>> LaunchLogger.initialize()
>>> epochs = 3
>>> for i in range(epochs):
...   with LaunchLogger("Train", epoch=i) as log:
...     # Log 3 mini-batches manually
...     log.log_minibatch({"loss": 1.0})
...     log.log_minibatch({"loss": 2.0})
...     log.log_minibatch({"loss": 3.0})

static initialize( use_wandb: bool = False, use_mlflow: bool = False, )[source]#

Initialize logging singleton

Parameters:

use_wandb (bool, optional) – Use WandB logging, by default False
use_mlflow (bool, optional) – Use MLFlow logging, by default False

log_epoch(losses: Dict[str, float])[source]#

Logs metrics for a single epoch

Parameters:: losses (Dict[str, float]) – Dictionary of metrics/loss values to log

log_figure( figure, artifact_file: str = 'artifact', plot_dir: str = './', log_to_file: bool = False, )[source]#

Logs figures on root process to wand or mlflow. Will store it to file in case neither are selected.

Parameters:

figure (Figure) – matplotlib or plotly figure to plot
artifact_file (str, optional) – File name. CAUTION overrides old files of same name
plot_dir (str, optional) – output directory for plot
log_to_file (bool, optional) – set to true in case figure shall be stored to file in addition to logging it to mlflow/wandb

log_minibatch(losses: Dict[str, float])[source]#

Logs metrics for a mini-batch epoch

This function should be called every mini-batch iteration. It will accumulate loss values over a datapipe. At the end of a epoch the average of these losses from each mini-batch will get calculated.

Parameters:: losses (Dict[str, float]) – Dictionary of metrics/loss values to log

classmethod toggle_mlflow(value: bool)[source]#

Toggle MLFlow logging

Parameters:: value (bool) – Use MLFlow logging

classmethod toggle_wandb(value: bool)[source]#

Toggle WandB logging

Parameters:: value (bool) – Use WandB logging

Console Logger#

A simple but powerful console-based logger that provides formatted output to the terminal. It’s particularly useful during development and debugging, offering clear visibility into training progress and metrics.

class physicsnemo.launch.logging.console.PythonLogger(name: str = 'launch')[source]#

Bases: object

Simple console logger for DL training This is a WIP

error(message: str)[source]#: Log error

file_logging(file_name: str = 'launch.log')[source]#: Log to file

info(message: str)[source]#: Log info

log(message: str)[source]#: Log message

success(message: str)[source]#: Log success

warning(message: str)[source]#: Log warning

class physicsnemo.launch.logging.console.RankZeroLoggingWrapper(obj, dist)[source]#

Bases: object

Wrapper class to only log from rank 0 process in distributed training.

MLflow Logger#

Integration with MLflow for experiment tracking and model management. This utility enables systematic tracking of experiments, including metrics, parameters, artifacts, and model versions. It’s particularly useful for teams that need to maintain reproducibility and compare different experiments. Users should initialize the MLflow backend before using the LaunchLogger.

physicsnemo.launch.logging.mlflow.check_mlflow_logged_in(client: mlflow.tracking.MlflowClient)[source]#

Checks to see if MLFlow URI is functioning

This isn’t the best solution right now and overrides http timeout. Can update if MLFlow use is increased.

physicsnemo.launch.logging.mlflow.initialize_mlflow( experiment_name: str, experiment_desc: str = None, run_name: str = None, run_desc: str = None, user_name: str = None, mode: Literal['offline', 'online', 'ngc'] = 'offline', tracking_location: str = None, artifact_location: str = None, ) → Tuple[mlflow.tracking.MlflowClient, mlflow.entities.run.Run][source]#

Initializes MLFlow logging client and run.

Parameters:

experiment_name (str) – Experiment name
experiment_desc (str, optional) – Experiment description, by default None
run_name (str, optional) – Run name, by default None
run_desc (str, optional) – Run description, by default None
user_name (str, optional) – User name, by default None
mode (str, optional) – MLFlow mode. Supports “offline”, “online” and “ngc”. Offline mode records logs to local file system. Online mode is for remote tracking servers. NGC is specific standardized setup for NGC runs, default “offline”
tracking_location (str, optional) – Tracking location for MLFlow. For offline this would be an absolute folder directory. For online mode this would be a http URI or databricks. For NGC, this option is ignored, by default “/<run directory>/mlruns”
artifact_location (str, optional) – Optional separate artifact location, by default None

Note

For NGC mode, one needs to mount a NGC workspace / folder system with a metric folder at /mlflow/mlflow_metrics/ and a artifact folder at /mlflow/mlflow_artifacts/.

Note

This will set up PhysicsNeMo Launch logger for MLFlow logging. Only one MLFlow logging client is supported with the PhysicsNeMo Launch logger.

Returns:: Returns MLFlow logging client and active run object
Return type:: Tuple[MlflowClient, Run]

Example usage:

from physicsnemo.launch.logging.mlflow import initialize_mlflow
from physicsnemo.launch.logging import LaunchLogger

# Initialize MLflow
initialize_mlflow(
    experiment_name="weather_prediction",
    user_name="physicsnemo_user",
    mode="offline",
)

# Create logger with MLflow backend
logger = LaunchLogger.initialize(use_mlflow=True)

Weights and Biases Logger#

Integration with Weights & Biases (W&B) for experiment tracking and visualization. This utility provides rich visualization capabilities and easy experiment comparison, making it ideal for projects that require detailed analysis of training runs and model performance. Users should initialize the W&B backend before using the LaunchLogger.

Weights and Biases Routines and Utilities

physicsnemo.launch.logging.wandb.alert(title, text, duration=300, level=0, is_master=True)[source]#: Send alert.

physicsnemo.launch.logging.wandb.initialize_wandb( project: str, entity: str, name: str = 'train', group: str = None, sync_tensorboard: bool = False, save_code: bool = False, resume: str = None, wandb_id: str = None, config=None, mode: Literal['offline', 'online', 'disabled'] = 'offline', results_dir: str = None, init_timeout: int = 90, )[source]#

Function to initialize wandb client with the weights and biases server.

Parameters:

project (str) – Name of the project to sync data with
entity (str,) – Name of the wanbd entity
sync_tensorboard (bool, optional) – sync tensorboard summary writer with wandb, by default False
save_code (bool, optional) – Whether to push a copy of the code to wandb dashboard, by default False
name (str, optional) – Name of the task running, by default “train”
group (str, optional) – Group name of the task running. Good to set for ddp runs, by default None
resume (str, optional) – Sets the resuming behavior. Options: “allow”, “must”, “never”, “auto” or None, by default None.
wandb_id (str, optional) – A unique ID for this run, used for resuming. Used in conjunction with resume parameter to enable experiment resuming. See W&B documentation for more details: https://docs.wandb.ai/guides/runs/resuming/
config (optional) – a dictionary-like object for saving inputs , like hyperparameters. If dict, argparse or absl.flags, it will load the key value pairs into the wandb.config object. If str, it will look for a yaml file by that name, by default None.
mode (str, optional) – Can be “offline”, “online” or “disabled”, by default “offline”
results_dir (str, optional) – Output directory of the experiment, by default “/<run directory>/wandb”
init_timeout (int, optional) – Timeout for wandb initialization, by default 90 seconds.

physicsnemo.launch.logging.wandb.is_wandb_initialized()[source]#: Check if wandb has been initialized.

Example usage:

from physicsnemo.launch.logging.wandb import initialize_wandb
from physicsnemo.launch.logging import LaunchLogger

# Initialize W&B
initialize_wandb(
    project="physics_simulation",
    entity="my_team"
)

# Create logger with W&B backend
logger = LaunchLogger.initialize(use_wandb=True)

Logging utils#

Utility functions and helpers for logging operations.

physicsnemo.launch.logging.utils.create_ddp_group_tag(group_name: str = None) → str[source]#

Creates a common group tag for logging

For some reason this does not work with multi-node. Seems theres a bug in PyTorch when one uses a distributed util before DDP

Parameters:: group_name (str, optional) – Optional group name prefix. If None will use "DDP_Group_", by default None
Returns:: Group tag
Return type:: str