NVIDIA Modulus Launch v0.2.0
Launch v0.2.0

Modulus Launch Logging

class modulus.launch.logging.launch.LaunchLogger(name_space, *args, **kwargs)[source]

Bases: object

Modulus Launch logger

An abstracted logger class that takes care of several fundamental logging functions. This class should first be initialized and then used via a context manager. This will auto compute epoch metrics. This is the standard logger for Modulus examples.

Parameters
  • name_space (str) – Namespace of logger to use. This will define the loggers title in the console and the wandb group the metric is plotted

  • epoch (int, optional) – Current epoch, by default 1

  • num_mini_batch (Union[int, None], optional) – Number of mini-batches used to calculate the epochs progress, by default None

  • profile (bool, optional) – Profile code using nvtx markers, by default False

  • mini_batch_log_freq (int, optional) – Frequency to log mini-batch losses, by default 100

  • epoch_alert_freq (Union[int, None], optional) – Epoch frequency to send training alert, by default None

Example

Copy
Copied!
            

>>> from modulus.launch.logging import LaunchLogger >>> LaunchLogger.initialize() >>> epochs = 3 >>> for i in range(epochs): ... with LaunchLogger("Train", epoch=i) as log: ... # Log 3 mini-batches manually ... log.log_minibatch({"loss": 1.0}) ... log.log_minibatch({"loss": 2.0}) ... log.log_minibatch({"loss": 3.0})

static initialize(use_wandb: bool = False, use_mlflow: bool = False)[source]

Initialize logging singleton

Parameters
  • use_wandb (bool, optional) – Use WandB logging, by default False

  • use_mlflow (bool, optional) – Use MLFlow logging, by default False

log_epoch(losses: Dict[str, float])[source]

Logs metrics for a single epoch

Parameters

losses (Dict[str, float]) – Dictionary of metrics/loss values to log

log_minibatch(losses: Dict[str, float])[source]

Logs metrics for a mini-batch epoch

This function should be called every mini-batch iteration. It will accumulate loss values over a datapipe. At the end of a epoch the average of these losses from each mini-batch will get calculated.

Parameters

losses (Dict[str, float]) – Dictionary of metrics/loss values to log

classmethod toggle_mlflow(value: bool)[source]

Toggle MLFlow logging

Parameters

value (bool) – Use MLFlow logging

classmethod toggle_wandb(value: bool)[source]

Toggle WandB logging

Parameters

value (bool) – Use WandB logging

class modulus.launch.logging.console.PythonLogger(name: str = 'launch')[source]

Bases: object

Simple console logger for DL training This is a WIP

error(message: str)[source]

Log error

file_logging(file_name: str = 'launch.log')[source]

Log to file

info(message: str)[source]

Log info

log(message: str)[source]

Log message

success(message: str)[source]

Log success

warning(message: str)[source]

Log warning

class modulus.launch.logging.console.RankZeroLoggingWrapper(obj, dist)[source]

Bases: object

Wrapper class to only log from rank 0 process in distributed training.

modulus.launch.logging.mlflow.check_mlflow_logged_in(client: MlflowClient)[source]

Checks to see if MLFlow URI is functioning

This isn’t the best solution right now and overrides http timeout. Can update if MLFlow use is increased.

modulus.launch.logging.mlflow.initialize_mlflow(experiment_name: str, experiment_desc: Optional[str] = None, run_name: Optional[str] = None, run_desc: Optional[str] = None, user_name: Optional[str] = None, mode: Literal['offline', 'online', 'ngc'] = 'offline', tracking_location: Optional[str] = None, artifact_location: Optional[str] = None) → Tuple[MlflowClient, Run][source]

Initializes MLFlow logging client and run.

Parameters
  • experiment_name (str) – Experiment name

  • experiment_desc (str, optional) – Experiment description, by default None

  • run_name (str, optional) – Run name, by default None

  • run_desc (str, optional) – Run description, by default None

  • user_name (str, optional) – User name, by default None

  • mode (str, optional) – MLFlow mode. Supports “offline”, “online” and “ngc”. Offline mode records logs to local file system. Online mode is for remote tracking servers. NGC is specific standardized setup for NGC runs, default “offline”

  • tracking_location (str, optional) – Tracking location for MLFlow. For offline this would be an absolute folder directory. For online mode this would be a http URI or databricks. For NGC, this option is ignored, by default “/<run directory>/mlruns”

  • artifact_location (str, optional) – Optional separate artifact location, by default None

Note

For NGC mode, one needs to mount a NGC workspace / folder system with a metric folder at /mlflow/mlflow_metrics/ and a artifact folder at /mlflow/mlflow_artifacts/.

Note

This will set up Modulus Launch logger for MLFlow logging. Only one MLFlow logging client is supported with the Modulus Launch logger.

Returns

Returns MLFlow logging client and active run object

Return type

Tuple[MlflowClient, Run]

Weights and Biases Routines and Utilities

modulus.launch.logging.wandb.alert(title, text, duration=300, level=0, is_master=True)[source]

Send alert.

modulus.launch.logging.wandb.initialize_wandb(project: str, entity: str, name: str = 'train', group: Optional[str] = None, sync_tensorboard: bool = False, save_code: bool = False, resume: Optional[str] = None, config=None, mode: Literal['offline', 'online', 'disabled'] = 'offline', results_dir: Optional[str] = None)[source]

Function to initialize wandb client with the weights and biases server.

Parameters
  • project (str) – Name of the project to sync data with

  • entity (str,) – Name of the wanbd entity

  • sync_tensorboard (bool, optional) – sync tensorboard summary writer with wandb, by default False

  • save_code (bool, optional) – Whether to push a copy of the code to wandb dashboard, by default False

  • name (str, optional) – Name of the task running, by default “train”

  • group (str, optional) – Group name of the task running. Good to set for ddp runs, by default None

  • resume (str, optional) – Sets the resuming behavior. Options: “allow”, “must”, “never”, “auto” or None, by default None.

  • config (optional) – a dictionary-like object for saving inputs , like hyperparameters. If dict, argparse or absl.flags, it will load the key value pairs into the wandb.config object. If str, it will look for a yaml file by that name, by default None.

  • mode (str, optional) – Can be “offline”, “online” or “disabled”, by default “offline”

  • results_dir (str, optional) – Output directory of the experiment, by default “/<run directory>/wandb”

modulus.launch.logging.wandb.is_wandb_initialized()[source]

Check if wandb has been initialized.

modulus.launch.logging.utils.create_ddp_group_tag(group_name: Optional[str] = None) → str[source]

Creates a common group tag for logging

For some reason this does not work with multi-node. Seems theres a bug in PyTorch when one uses a distributed util before DDP

Parameters

group_name (str, optional) – Optional group name prefix. If None will use “DDP_Group_”, by default None

Returns

Group tag

Return type

str

© Copyright 2023, NVIDIA Modulus Team. Last updated on Sep 21, 2023.