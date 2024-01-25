Modulus provides utilities to standardize the logs of different training runs. Using the logging utilites from Modulus, you would have the flexibility of choosing between the good-old console logging to more advanced ML experiments trackers like MLFlow and Weights & Biases. You can always implement these loggers yourself, but in this example, we will use the utilites from Modulus that will not only simplify this process but also provide a standardized output format. Let’s get started.

The below example shows a simple setup using the console logging.

Copy Copied! import torch import modulus from modulus.datapipes.benchmarks.darcy import Darcy2D from modulus.launch.logging import LaunchLogger, PythonLogger from modulus.metrics.general.mse import mse from modulus.models.fno.fno import FNO

Copy Copied! normaliser = { "permeability": (1.25, 0.75), "darcy": (4.52e-2, 2.79e-2), } dataloader = Darcy2D( resolution=256, batch_size=64, nr_permeability_freq=5, normaliser=normaliser ) model = FNO( in_channels=1, out_channels=1, decoder_layers=1, decoder_layer_size=32, dimension=2, latent_channels=32, num_fno_layers=4, num_fno_modes=12, padding=5, ).to("cuda") optimizer = torch.optim.Adam(model.parameters(), lr=0.01) scheduler = torch.optim.lr_scheduler.LambdaLR( optimizer, lr_lambda=lambda step: 0.85**step ) # Initialize the logger logger = PythonLogger("main") # General python logger LaunchLogger.initialize() # Use logger methods to track various information during training logger.info("Starting Training!") # we will setup the training to run for 20 epochs each epoch running for 5 iterations for i in range(20): # wrap the epoch in launch logger to control frequency of output for console logs with LaunchLogger("train", epoch=i) as launchlog: # this would be iterations through different batches for _ in range(5): batch = next(iter(dataloader)) true = batch["darcy"] pred = model(batch["permeability"]) loss = mse(pred, true) loss.backward() optimizer.step() scheduler.step() launchlog.log_minibatch({"Loss": loss.detach().cpu().numpy()}) launchlog.log_epoch({"Learning Rate": optimizer.param_groups[0]["lr"]}) logger.info("Finished Training!")

The logger output can be seen below.

Copy Copied! Warp 0.10.1 initialized: CUDA Toolkit: 11.5, Driver: 12.2 Devices: "cpu" | x86_64 "cuda:0" | Tesla V100-SXM2-16GB-N (sm_70) "cuda:1" | Tesla V100-SXM2-16GB-N (sm_70) "cuda:2" | Tesla V100-SXM2-16GB-N (sm_70) "cuda:3" | Tesla V100-SXM2-16GB-N (sm_70) "cuda:4" | Tesla V100-SXM2-16GB-N (sm_70) "cuda:5" | Tesla V100-SXM2-16GB-N (sm_70) "cuda:6" | Tesla V100-SXM2-16GB-N (sm_70) "cuda:7" | Tesla V100-SXM2-16GB-N (sm_70) Kernel cache: /root/.cache/warp/0.10.1 /usr/local/lib/python3.10/dist-packages/pydantic/_internal/_fields.py:128: UserWarning: Field "model_server_url" has conflict with protected namespace "model_". You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`. warnings.warn( /usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:317: UserWarning: Valid config keys have changed in V2: * 'schema_extra' has been renamed to 'json_schema_extra' warnings.warn(message, UserWarning) [21:23:57 - main - INFO] Starting Training! Module modulus.datapipes.benchmarks.kernels.initialization load on device 'cuda:0' took 73.06 ms Module modulus.datapipes.benchmarks.kernels.utils load on device 'cuda:0' took 314.91 ms Module modulus.datapipes.benchmarks.kernels.finite_difference load on device 'cuda:0' took 149.86 ms [21:24:02 - train - INFO] Epoch 0 Metrics: Learning Rate = 4.437e-03, Loss = 1.009e+00 [21:24:02 - train - INFO] Epoch Execution Time: 5.664e+00s, Time/Iter: 1.133e+03ms [21:24:06 - train - INFO] Epoch 1 Metrics: Learning Rate = 1.969e-03, Loss = 6.040e-01 [21:24:06 - train - INFO] Epoch Execution Time: 4.013e+00s, Time/Iter: 8.025e+02ms ... [21:25:32 - train - INFO] Epoch 19 Metrics: Learning Rate = 8.748e-10, Loss = 1.384e-01 [21:25:32 - train - INFO] Epoch Execution Time: 4.010e+00s, Time/Iter: 8.020e+02ms [21:25:32 - main - INFO] Finished Training!

The below example shows a simple setup using the MLFlow logging. The only difference from the previous example is that, we will use initialize_mlflow function to initialize the MLFlow client and also set use_mlflow=True when initializing the LaunchLogger .

Copy Copied! import torch import modulus from modulus.datapipes.benchmarks.darcy import Darcy2D from modulus.launch.logging import LaunchLogger, PythonLogger, initialize_mlflow from modulus.metrics.general.mse import mse from modulus.models.fno.fno import FNO

Copy Copied! normaliser = { "permeability": (1.25, 0.75), "darcy": (4.52e-2, 2.79e-2), } dataloader = Darcy2D( resolution=256, batch_size=64, nr_permeability_freq=5, normaliser=normaliser ) model = FNO( in_channels=1, out_channels=1, decoder_layers=1, decoder_layer_size=32, dimension=2, latent_channels=32, num_fno_layers=4, num_fno_modes=12, padding=5, ).to("cuda") optimizer = torch.optim.Adam(model.parameters(), lr=0.01) scheduler = torch.optim.lr_scheduler.LambdaLR( optimizer, lr_lambda=lambda step: 0.85**step ) # Initialize the console logger logger = PythonLogger("main") # General python logger # Initialize the MLFlow logger initialize_mlflow( experiment_name="Modulus Tutorials", experiment_desc="Simple Modulus Tutorials", run_name="Modulus MLFLow Tutorial", run_desc="Modulus Tutorial Training", user_name="Modulus User", mode="offline", ) LaunchLogger.initialize(use_mlflow=True) # Use logger methods to track various information during training logger.info("Starting Training!") # we will setup the training to run for 20 epochs each epoch running for 5 iterations for i in range(20): # wrap the epoch in launch logger to control frequency of output for console logs with LaunchLogger("train", epoch=i) as launchlog: for _ in range(5): batch = next(iter(dataloader)) true = batch["darcy"] pred = model(batch["permeability"]) loss = mse(pred, true) loss.backward() optimizer.step() scheduler.step() launchlog.log_minibatch({"Loss": loss.detach().cpu().numpy()}) launchlog.log_epoch({"Learning Rate": optimizer.param_groups[0]["lr"]}) logger.info("Finished Training!")

During the run, you will notice a directory named as mlruns_0 created which stores the mlflow logs. To visulaize the logs interactively, you can run the following:

Copy Copied! mlflow ui --backend-store-uri mlruns_0/

And then navigate to localhost:5000 in your favorite browser.

Warning Currently the MLFlow logger will log the output of each processor separately. So in multi-processor runs, you will see multiple directories being created. This is a known issue and will be fixed in the future releases.

The below example shows a simple setup using the Weights and Biases logging. The only difference from the previous example is that, we will use initialize_wandb function to initialize the Weights and Biases logger and also set use_wandb=True when initializing the LaunchLogger .

Copy Copied! import torch import modulus from modulus.datapipes.benchmarks.darcy import Darcy2D from modulus.launch.logging import LaunchLogger, PythonLogger, initialize_wandb from modulus.metrics.general.mse import mse from modulus.models.fno.fno import FNO

Copy Copied! normaliser = { "permeability": (1.25, 0.75), "darcy": (4.52e-2, 2.79e-2), } dataloader = Darcy2D( resolution=256, batch_size=64, nr_permeability_freq=5, normaliser=normaliser ) model = FNO( in_channels=1, out_channels=1, decoder_layers=1, decoder_layer_size=32, dimension=2, latent_channels=32, num_fno_layers=4, num_fno_modes=12, padding=5, ).to("cuda") optimizer = torch.optim.Adam(model.parameters(), lr=0.01) scheduler = torch.optim.lr_scheduler.LambdaLR( optimizer, lr_lambda=lambda step: 0.85**step ) # Initialize the console logger logger = PythonLogger("main") # General python logger # Initialize the MLFlow logger initialize_wandb( project="Modulus Tutorials", name="Simple Modulus Tutorials", entity="Modulus MLFLow Tutorial", mode="offline", ) LaunchLogger.initialize(use_wandb=True) # Use logger methods to track various information during training logger.info("Starting Training!") # we will setup the training to run for 20 epochs each epoch running for 10 iterations for i in range(20): # wrap the epoch in launch logger to control frequency of output for console logs with LaunchLogger("train", epoch=i) as launchlog: # this would be iterations through different batches for _ in range(10): batch = next(iter(dataloader)) true = batch["darcy"] pred = model(batch["permeability"]) loss = mse(pred, true) loss.backward() optimizer.step() scheduler.step() launchlog.log_minibatch({"Loss": loss.detach().cpu().numpy()}) launchlog.log_epoch({"Learning Rate": optimizer.param_groups[0]["lr"]}) logger.info("Finished Training!")

During the run, you will notice a directory named as wandb created which stores the wandb logs.

The logger output can also be seen below.

Copy Copied! Warp 0.10.1 initialized: CUDA Toolkit: 11.5, Driver: 12.2 Devices: "cpu" | x86_64 "cuda:0" | Tesla V100-SXM2-16GB-N (sm_70) "cuda:1" | Tesla V100-SXM2-16GB-N (sm_70) "cuda:2" | Tesla V100-SXM2-16GB-N (sm_70) "cuda:3" | Tesla V100-SXM2-16GB-N (sm_70) "cuda:4" | Tesla V100-SXM2-16GB-N (sm_70) "cuda:5" | Tesla V100-SXM2-16GB-N (sm_70) "cuda:6" | Tesla V100-SXM2-16GB-N (sm_70) "cuda:7" | Tesla V100-SXM2-16GB-N (sm_70) Kernel cache: /root/.cache/warp/0.10.1 /usr/local/lib/python3.10/dist-packages/pydantic/_internal/_fields.py:128: UserWarning: Field "model_server_url" has conflict with protected namespace "model_". You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`. warnings.warn( /usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:317: UserWarning: Valid config keys have changed in V2: * 'schema_extra' has been renamed to 'json_schema_extra' warnings.warn(message, UserWarning) wandb: Tracking run with wandb version 0.15.12 wandb: W&B syncing is set to `offline` in this directory. wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing. [21:26:38 - main - INFO] Starting Training! Module modulus.datapipes.benchmarks.kernels.initialization load on device 'cuda:0' took 74.11 ms Module modulus.datapipes.benchmarks.kernels.utils load on device 'cuda:0' took 310.06 ms Module modulus.datapipes.benchmarks.kernels.finite_difference load on device 'cuda:0' took 151.24 ms [21:26:48 - train - INFO] Epoch 0 Metrics: Learning Rate = 1.969e-03, Loss = 7.164e-01 [21:26:48 - train - INFO] Epoch Execution Time: 9.703e+00s, Time/Iter: 9.703e+02ms ... [21:29:47 - train - INFO] Epoch 19 Metrics: Learning Rate = 7.652e-17, Loss = 3.519e-01 [21:29:47 - train - INFO] Epoch Execution Time: 1.125e+01s, Time/Iter: 1.125e+03ms [21:29:47 - main - INFO] Finished Training! wandb: Waiting for W&B process to finish... (success). wandb: wandb: Run history: wandb: epoch ▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██ wandb: train/Epoch Time (s) ▃▁▃▃▃▃▁█▁▁▁▃▃▃▃▆▁▃▃▆ wandb: train/Learning Rate █▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: train/Loss █▁▂▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃ wandb: train/Time per iter (ms) ▃▁▃▃▃▃▁█▁▁▁▃▃▃▃▆▁▃▃▆ wandb: wandb: Run summary: wandb: epoch 19 wandb: train/Epoch Time (s) 11.24806 wandb: train/Learning Rate 0.0 wandb: train/Loss 0.35193 wandb: train/Time per iter (ms) 1124.80645 wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /workspace/modulus/docs/test_scripts/wandb/wandb/offline-run-20231115_212638-ib4ylq4e wandb: Find logs at: ./wandb/wandb/offline-run-20231115_212638-ib4ylq4e/logs

To visulaize the logs interactively, simply follow the instructions printed in the outputs.