PhysicsNeMo Launch Logging
The PhysicsNeMo Launch Logging module provides a comprehensive and flexible logging system for machine learning experiments and physics simulations. It offers multiple logging backends including console output, MLflow, and Weights & Biases (W&B), allowing users to track metrics, artifacts, and experiment parameters across different platforms. The module is designed to work seamlessly in both single-process and distributed training environments.
Key Features: - Unified logging interface across different backends - Support for distributed training environments - Automatic metric aggregation and synchronization - Flexible configuration and customization options - Integration with popular experiment tracking platforms
Consider the following example usage:
from physicsnemo.launch.logging import LaunchLogger
# Initialize the logger
logger = LaunchLogger.initialize(use_mlflow=True)
# Training loop
for epoch in range(num_epochs):
# Training logger
with LaunchLogger(
"train", epoch = epoch, num_mini_batch = len(training_datapipe), epoch_alert_freq = 1
) as logger:
for batch in training_datapipe:
# Training loop
... # training code
logger.log_metrics({"train_loss": training_loss})
# Validation logger
with LaunchLogger(
"val", epoch = epoch, num_mini_batch = len(validation_datapipe), epoch_alert_freq = 1
) as logger:
for batch in validation_datapipe:
# Validation loop
... # validation code
logger.log_minibatch({"val_loss": validation_loss})
learning_rate = ... # get the learning rate at the end of the epoch from the optimizer
logger.log_epoch({"learning_rate": learning_rate}) # log the learning rate at the end of the epoch
This example shows how to use the LaunchLogger to log metrics during training and validation. The LaunchLogger is initialized with the MLflow backend, and the logger is created for each epoch, a separate logger is created for training and validation. We can use the .log_minibatch method to log metrics during training and validation. We can use the .log_epoch method to log the learning rate at the end of the epoch.
For a more detailed example, please refer to the Logging and Checkpointing recipe
The LaunchLogger serves as the primary interface for logging in PhysicsNeMo. It provides a unified API that works consistently across different logging backends and training environments. The logger automatically handles metric aggregation in distributed settings and ensures proper synchronization across processes.
- class physicsnemo.launch.logging.launch.LaunchLogger(name_space, *args, **kwargs)[source]
Bases:
object
PhysicsNeMo Launch logger
An abstracted logger class that takes care of several fundamental logging functions. This class should first be initialized and then used via a context manager. This will auto compute epoch metrics. This is the standard logger for PhysicsNeMo examples.
- Parameters
name_space (str) – Namespace of logger to use. This will define the loggers title in the console and the wandb group the metric is plotted
epoch (int, optional) – Current epoch, by default 1
num_mini_batch (Union[int, None], optional) – Number of mini-batches used to calculate the epochs progress, by default None
profile (bool, optional) – Profile code using nvtx markers, by default False
mini_batch_log_freq (int, optional) – Frequency to log mini-batch losses, by default 100
epoch_alert_freq (Union[int, None], optional) – Epoch frequency to send training alert, by default None
Example
>>> from physicsnemo.launch.logging import LaunchLogger >>> LaunchLogger.initialize() >>> epochs = 3 >>> for i in range(epochs): ... with LaunchLogger("Train", epoch=i) as log: ... # Log 3 mini-batches manually ... log.log_minibatch({"loss": 1.0}) ... log.log_minibatch({"loss": 2.0}) ... log.log_minibatch({"loss": 3.0})
- static initialize(use_wandb: bool = False, use_mlflow: bool = False)[source]
Initialize logging singleton
- Parameters
use_wandb (bool, optional) – Use WandB logging, by default False
use_mlflow (bool, optional) – Use MLFlow logging, by default False
- log_epoch(losses: Dict[str, float])[source]
Logs metrics for a single epoch
- Parameters
losses (Dict[str, float]) – Dictionary of metrics/loss values to log
- log_figure(figure, artifact_file: str = 'artifact', plot_dir: str = './', log_to_file: bool = False)[source]
Logs figures on root process to wand or mlflow. Will store it to file in case neither are selected.
- Parameters
figure (Figure) – matplotlib or plotly figure to plot
artifact_file (str, optional) – File name. CAUTION overrides old files of same name
plot_dir (str, optional) – output directory for plot
log_to_file (bool, optional) – set to true in case figure shall be stored to file in addition to logging it to mlflow/wandb
- log_minibatch(losses: Dict[str, float])[source]
Logs metrics for a mini-batch epoch
This function should be called every mini-batch iteration. It will accumulate loss values over a datapipe. At the end of a epoch the average of these losses from each mini-batch will get calculated.
- Parameters
losses (Dict[str, float]) – Dictionary of metrics/loss values to log
- classmethod toggle_mlflow(value: bool)[source]
Toggle MLFlow logging
- Parameters
value (bool) – Use MLFlow logging
- classmethod toggle_wandb(value: bool)[source]
Toggle WandB logging
- Parameters
value (bool) – Use WandB logging
A simple but powerful console-based logger that provides formatted output to the terminal. It’s particularly useful during development and debugging, offering clear visibility into training progress and metrics.
- class physicsnemo.launch.logging.console.PythonLogger(name: str = 'launch')[source]
Bases:
object
Simple console logger for DL training This is a WIP
- error(message: str)[source]
Log error
- file_logging(file_name: str = 'launch.log')[source]
Log to file
- info(message: str)[source]
Log info
- log(message: str)[source]
Log message
- success(message: str)[source]
Log success
- warning(message: str)[source]
Log warning
- class physicsnemo.launch.logging.console.RankZeroLoggingWrapper(obj, dist)[source]
Bases:
object
Wrapper class to only log from rank 0 process in distributed training.
Integration with MLflow for experiment tracking and model management. This utility enables systematic tracking of experiments, including metrics, parameters, artifacts, and model versions. It’s particularly useful for teams that need to maintain reproducibility and compare different experiments. Users should initialize the MLflow backend before using the LaunchLogger.
- physicsnemo.launch.logging.mlflow.check_mlflow_logged_in(client: MlflowClient)[source]
Checks to see if MLFlow URI is functioning
This isn’t the best solution right now and overrides http timeout. Can update if MLFlow use is increased.
- physicsnemo.launch.logging.mlflow.initialize_mlflow(experiment_name: str, experiment_desc: str = None, run_name: str = None, run_desc: str = None, user_name: str = None, mode: Literal['offline', 'online', 'ngc'] = 'offline', tracking_location: str = None, artifact_location: str = None) → Tuple[MlflowClient, Run][source]
Initializes MLFlow logging client and run.
- Parameters
experiment_name (str) – Experiment name
experiment_desc (str, optional) – Experiment description, by default None
run_name (str, optional) – Run name, by default None
run_desc (str, optional) – Run description, by default None
user_name (str, optional) – User name, by default None
mode (str, optional) – MLFlow mode. Supports “offline”, “online” and “ngc”. Offline mode records logs to local file system. Online mode is for remote tracking servers. NGC is specific standardized setup for NGC runs, default “offline”
tracking_location (str, optional) – Tracking location for MLFlow. For offline this would be an absolute folder directory. For online mode this would be a http URI or databricks. For NGC, this option is ignored, by default “/<run directory>/mlruns”
artifact_location (str, optional) – Optional separate artifact location, by default None
NoteFor NGC mode, one needs to mount a NGC workspace / folder system with a metric folder at /mlflow/mlflow_metrics/ and a artifact folder at /mlflow/mlflow_artifacts/.
NoteThis will set up PhysicsNeMo Launch logger for MLFlow logging. Only one MLFlow logging client is supported with the PhysicsNeMo Launch logger.
- Returns
Returns MLFlow logging client and active run object
- Return type
Tuple[MlflowClient, Run]
Example usage:
from physicsnemo.launch.logging.mlflow import initialize_mlflow
from physicsnemo.launch.logging import LaunchLogger
# Initialize MLflow
initialize_mlflow(
experiment_name="weather_prediction",
user_name="physicsnemo_user",
mode="offline",
)
# Create logger with MLflow backend
logger = LaunchLogger.initialize(use_mlflow=True)
Integration with Weights & Biases (W&B) for experiment tracking and visualization. This utility provides rich visualization capabilities and easy experiment comparison, making it ideal for projects that require detailed analysis of training runs and model performance. Users should initialize the W&B backend before using the LaunchLogger.
Weights and Biases Routines and Utilities
- physicsnemo.launch.logging.wandb.alert(title, text, duration=300, level=0, is_master=True)[source]
Send alert.
- physicsnemo.launch.logging.wandb.initialize_wandb(project: str, entity: str, name: str = 'train', group: str = None, sync_tensorboard: bool = False, save_code: bool = False, resume: str = None, wandb_id: str = None, config=None, mode: Literal['offline', 'online', 'disabled'] = 'offline', results_dir: str = None)[source]
Function to initialize wandb client with the weights and biases server.
- Parameters
project (str) – Name of the project to sync data with
entity (str,) – Name of the wanbd entity
sync_tensorboard (bool, optional) – sync tensorboard summary writer with wandb, by default False
save_code (bool, optional) – Whether to push a copy of the code to wandb dashboard, by default False
name (str, optional) – Name of the task running, by default “train”
group (str, optional) – Group name of the task running. Good to set for ddp runs, by default None
resume (str, optional) – Sets the resuming behavior. Options: “allow”, “must”, “never”, “auto” or None, by default None.
wandb_id (str, optional) – A unique ID for this run, used for resuming. Used in conjunction with resume parameter to enable experiment resuming. See W&B documentation for more details: https://docs.wandb.ai/guides/runs/resuming/
config (optional) – a dictionary-like object for saving inputs , like hyperparameters. If dict, argparse or absl.flags, it will load the key value pairs into the wandb.config object. If str, it will look for a yaml file by that name, by default None.
mode (str, optional) – Can be “offline”, “online” or “disabled”, by default “offline”
results_dir (str, optional) – Output directory of the experiment, by default “/<run directory>/wandb”
- physicsnemo.launch.logging.wandb.is_wandb_initialized()[source]
Check if wandb has been initialized.
Example usage:
from physicsnemo.launch.logging.wandb import initialize_wandb
from physicsnemo.launch.logging import LaunchLogger
# Initialize W&B
initialize_wandb(
project="physics_simulation",
entity="my_team"
)
# Create logger with W&B backend
logger = LaunchLogger.initialize(use_wandb=True)
Utility functions and helpers for logging operations.
- physicsnemo.launch.logging.utils.create_ddp_group_tag(group_name: str = None) → str[source]
Creates a common group tag for logging
For some reason this does not work with multi-node. Seems theres a bug in PyTorch when one uses a distributed util before DDP
- Parameters
group_name (str, optional) – Optional group name prefix. If None will use
"DDP_Group_"
, by default None- Returns
Group tag
- Return type
str