Logging and Checkpointing#
Three main classes in NeMo 2.0 are responsible for configuring logging and checkpointing directories:
nemo.lightning.pytorch.callbacks.model_checkpoint.ModelCheckpointHandles the logic that determines when to save a checkpoint.
Provides the ability to perform asynchronous checkpointing.
nemo.lightning.nemo_logger.NeMoLoggerResponsible for setting logging directories.
Optionally configures the trainer’s loggers.
nemo.lightning.resume.AutoResumeSets the checkpointing directory.
Determines whether there is an existing checkpoint from which to resume.
ModelCheckpoint#
The ModelCheckpoint callback in NeMo 2.0 is a wrapper around Pytorch Lightning’s ModelCheckpoint. It manages when to save and clean up checkpoints during training. Additionally, it supports saving a checkpoint at the end of training and provides the necessary support for asynchronous checkpointing.
The following is an example of how to instantiate a ModelCheckpoint callback:
from nemo.lightning.pytorch.callbacks import ModelCheckpoint
checkpoint_callback = ModelCheckpoint(
save_last=True,
monitor="val_loss",
save_top_k=2,
every_n_train_steps=30,
dirpath='my_model_directory',
always_save_context=True,
)
Refer to the documentation for NeMo Lightning and PyTorch Lightning’s ModelCheckpoint classes to find the complete list of supported arguments. Here, dirpath refers to the directory to save the checkpoints. Note that dirpath is optional. If not provided, it will default to log_dir / checkpoints, where log_dir is the path determined by the NeMoLogger, as described in detail in the subsequent section.
In addition, note that asynchronous checkpointing is set using the ckpt_async_save argument in MegatronStrategy. This attribute is then accessed by the checkpoint callback to perform async checkpointing as requested.
Two options are available to pass the ModelCheckpoint callback instance to the trainer.
Add the callback to the set of callbacks and then pass the callbacks directly to the trainer:
import nemo.lightning as nl callbacks = [checkpoint_callback] ### add any other desired callbacks... trainer = nl.Trainer( ... callbacks = callbacks, ... )
Pass the callback to the
NeMoLogger, as described in theNeMoLoggersection below.
Checkpoint Directory Structure#
By default, ModelCheckpoint in NeMo saves checkpoints with the following structure:
log_dir
└-checkpoints/
└-model_name=...step=...consumed_samples=.../
├-context/
| ├-io.json
| ├-model.yaml
| ├- ...
├-weights/
| ├-common.pt
| ├-metadata.json
| ├-__0_0.distcp
| ├-__1_0.distcp
| ├-...
The context directory contains the artifacts needed to reinitialize the experiment’s model, trainer, and dataloader. It is present only if one of the following conditions is met:
always_save_contextis set toTruewhen instantiatingModelCheckpoint, orsave_context_on_train_endis set toTrueand the checkpoint is the final checkpoint of the training run.
The configuration of the model checkpoint is saved in io.json and displayed as a human-readable file in model.yaml. io.json is the source of truth for model configuration; modifying model.yaml has no effect when loading the model.
The weights directory consists primarily of .distcp files which store the distributed checkpoint. By default, there are two .distcp files per rank.
NeMoLogger#
The NeMoLogger class provides a standardized way to set up logging for NeMo experiments. It creates a new log directory (or reuses an existing one), manages experiment names and versions (optionally using timestamps), and can configure multiple loggers (e.g., TensorBoard and WandB). It also handles copying important files (like configurations) and manages checkpoint settings, ensuring all experiment artifacts are consistently organized.
Please refer to the NeMoLogger documentation for details on all supported arguments.
Here is an example of creating a new NeMoLogger instance:
from nemo.lightning import NeMoLogger
nemo_logger = NeMoLogger(
log_dir='my_logging_dir',
name='experiment1',
use_datetime_version=False,
)
By default, the directory where logs are written is log_dir / name / version. If an explicit version is not provided and use_datetime_version is False, the directory will change to log_dir / name.
As mentioned earlier, you can optionally pass your ModelCheckpoint instance in here, and the logger will automatically configure the checkpoint callback in your trainer:
nemo_logger = NeMoLogger(
...
ckpt=checkpoint_callback,
...
)
Once your trainer has been initialized, the NeMoLogger can be set up using the following command:
nemo_logger.setup(
trainer,
resume_if_exists,
)
The resume_if_exists boolean indicates whether to resume from the latest checkpoint, if one is available. The value of resume_if_exists should match the value passed into AutoResume, as described below.
Experiment Logging#
NeMo 2.0 provides built-in support for logging experiments using popular tracking tools like TensorBoard and Weights & Biases (wandb).
TensorBoard Logging#
To use TensorBoard logging with NeMo 2.0:
First, ensure you have TensorBoard installed:
pip install tensorboard
Configure the TensorBoardLogger and add it to your NeMoLogger:
from lightning.pytorch.loggers import TensorBoardLogger # Create TensorBoard logger tensorboard = TensorBoardLogger( save_dir="tb_logs", # Directory to store TensorBoard logs name="my_model", # Name of the experiment version=None, # Optional version number ) # Add TensorBoard logger to NeMoLogger nemo_logger = NeMoLogger( tensorboard=tensorboard, # Pass TensorBoard logger here ... )
In this example, The TensorBoard logs will be saved in the directory tb_logs as a subdirectory of the my_model experiment dir.
The update_logger_directory argument in NeMoLogger controls whether to update the directory of the TensorBoard logger to match the NeMo log dir. If set to True, the TensorBoard logger will also write to the same log directory.
Weights & Biases (wandb) Logging#
To use Weights & Biases (wandb) logging with NeMo 2.0:
First, ensure you have wandb installed:
pip install wandb
Configure the WandbLogger and add it to your NeMoLogger:
from lightning.pytorch.loggers import WandbLogger # Create Wandb logger wandb_logger = WandbLogger( project="my_project", # Name of the W&B project name="my_experiment", # Name of this specific run entity="my_team", # Optional: username or team name config={}, # Optional: dictionary of hyperparameters ) # Add Wandb logger to NeMoLogger nemo_logger = NeMoLogger( wandb=wandb_logger, # Pass Wandb logger here ... )
The Weights & Biases logs will be automatically synced to your wandb account under the specified project and name. You can view your experiment metrics, system stats, and model artifacts through the wandb web interface.
Just as with the TensorBoard logger, the update_logger_directory argument in NeMoLogger controls whether to update the directory of the wandb logger to match the NeMo log dir. If set to True, the wandb logger will also write to the same log directory.
AutoResume#
The AutoResume class manages checkpoint paths and checks for existing checkpoints to restore from. Here’s an example of how it can be used:
from nemo.lightning import AutoResume
resume = AutoResume(
resume_if_exists=True,
resume_ignore_no_checkpoint=True,
resume_from_directory="checkpoint_dir_to_resume_from"
)
In the script, resume_from_directory refers to the path of the checkpoint directory to resume from. If no resume_from_directory is provided, the directory to resume from will default to log_dir / checkpoints, where log_dir is determined by the NemoLogger instance as described in the previous section.
The resume_ignore_no_checkpoint boolean determines whether to proceed without error if resume_if_exists is set to True and no checkpoint is found in the checkpointing directory.
Ensure that the value of resume_if_exists matches the argument passed into the NemoLogger instance.
AutoResume should be set up in a similar fashion to NeMoLogger.
resume.setup(trainer, model)
Passing a model into the setup is optional. It is only required when importing a checkpoint from Hugging Face or other non-NeMo checkpoint formats.
Putting it All Together#
To put it all together, configuring loggers and checkpointers in NeMo 2.0 looks like this:
from lightning.pytorch.loggers import TensorBoardLogger
from lightning.pytorch.loggers import WandbLogger
checkpoint_callback = ModelCheckpoint(
save_last=True,
monitor="reduced_train_loss",
save_top_k=2,
every_n_train_steps=30,
dirpath='my_model_directory',
)
tensorboard = TensorBoardLogger(
save_dir="tb_logs",
name="experiment1",
)
wandb_logger = WandbLogger(
project="my_project",
name="my_experiment",
entity="my_team",
)
logger = nemo_logger = NeMoLogger(
log_dir='my_logging_dir',
name='experiment1',
use_datetime_version=False,
ckpt=checkpoint_callback,
tensorboard=tensorboard,
wandb=wandb_logger,
update_logger_directory=True,
)
resume = AutoResume(
resume_if_exists=True,
resume_ignore_no_checkpoint=True,
)
### setup your trainer here ###
nemo_logger.setup(
trainer,
getattr(resume, "resume_if_exists", False),
)
resume.setup(trainer)
Note that using both TensorBoardLogger and WandbLogger at the same time is possible, as shown here, but uncommon. This example is mainly for demonstration purposes, so please adapt it to your needs.