Logging and Checkpointing#
Three main classes in NeMo 2.0 are responsible for configuring logging and checkpointing directories:
nemo.lightning.pytorch.callbacks.model_checkpoint.ModelCheckpoint
Handles the logic that determines when to save a checkpoint. In addition, this class provides the ability to perform asynchronous checkpointing.
nemo.lightning.nemo_logger.NeMoLogger
Responsible for setting logging directories and (optionally) configuring the trainer’s loggers.
nemo.lightning.resume.AutoResume
Sets the checkpointing directory and determines whether there is an existing checkpoint from which to resume.
ModelCheckpoint#
The ModelCheckpoint
callback in NeMo 2.0 is a wrapper around Pytorch Lightning’s ModelCheckpoint
. It manages when to save and clean up checkpoints during training. Additionally, it supports saving a checkpoint at the end of training and provides the necessary support for asynchronous checkpointing.
The following is an example of how to instantiate a ModelCheckpoint
callback:
from nemo.lightning.pytorch.callbacks import ModelCheckpoint
checkpoint_callback = ModelCheckpoint(
save_last=True,
monitor="val_loss",
save_top_k=2,
every_n_train_steps=30,
dirpath='my_model_directory',
)
Refer to the documentation for NeMo Lightning and PyTorch Lightning’s ModelCheckpoint
classes to find the complete list of supported arguments. Here, dirpath
refers to the directory to save the checkpoints. Note that dirpath
is optional. If not provided, it will default to log_dir / checkpoints
, where log_dir
is the path determined by the NeMoLogger
, as described in detail in the subsequent section.
In addition, note that asynchronous checkponting is set using the ckpt_async_save
argument in MegatronStrategy. This attribute is then accessed by the checkpoint callback to perform async checkpointing as requested.
Two options are available to pass the ModelCheckpoint
callback instance to the trainer.
Add the callback to the set of callbacks and then pass the callbacks directly to the trainer:
import nemo.lightning as nl callbacks = [checkpoint_callback] ### add any other desired callbacks... trainer = nl.Trainer( ... callbacks = callbacks, ... )
Pass the callback to the
NeMoLogger
, as described below.
NeMoLogger#
The NeMoLogger
class provides a standardized way to set up logging for NeMo experiments. It creates a new log directory (or reuses an existing one), manages experiment names and versions (optionally using timestamps), and can configure multiple loggers (e.g., TensorBoard and WandB). It also handles copying important files (like configurations) and manages checkpoint settings, ensuring all experiment artifacts are consistently organized.
Please refer to the NeMoLogger documentation for details on all supported arguments.
Here is an example of creating a new NeMoLogger
instance:
from nemo.lightning import NeMoLogger
nemo_logger = NeMoLogger(
log_dir='my_logging_dir',
name='experiment1',
use_datetime_version=False,
)
By default, the directory where logs are written is log_dir / name / version
. If an explicit version is not provided and use_datetime_version
is False, the directory will change to log_dir / name
.
As mentioned earlier, you can optionally pass your ModelCheckpoint
instance in here, and the logger will automatically configure the checkpoint callback in your trainer:
nemo_logger = NeMoLogger(
...
ckpt=checkpoint_callback,
...
)
Once your trainer has been initialized, the NeMoLogger
can be set up using the following command:
nemo_logger.setup(
trainer,
resume_if_exists,
)
The resume_if_exists
boolean indicates whether to resume from the latest checkpoint, if one is available. The value of resume_if_exists
should match the value passed into AutoResume
as described below.
Experiment Logging#
NeMo 2.0 provides built-in support for logging experiments using popular tracking tools like TensorBoard and Weights & Biases (wandb).
TensorBoard Logging#
To use TensorBoard logging with NeMo 2.0:
First, ensure you have TensorBoard installed:
pip install tensorboard
Configure the TensorBoardLogger and add it to your NeMoLogger:
from lightning.pytorch.loggers import TensorBoardLogger # Create TensorBoard logger tensorboard = TensorBoardLogger( save_dir="tb_logs", # Directory to store TensorBoard logs name="my_model", # Name of the experiment version=None, # Optional version number ) # Add TensorBoard logger to NeMoLogger nemo_logger = NeMoLogger( tensorboard=tensorboard, # Pass TensorBoard logger here ... )
In this example, The TensorBoard logs will be saved in the directory tb_logs
as a subdirectory of the my_model
experiment dir.
The update_logger_directory
argument in NeMoLogger
controls whether to update the directory of the TensorBoard logger to match the NeMo log dir. If set to True
, the TensorBoard logger will also write to the same log directory.
Weights & Biases (wandb) Logging#
To use Weights & Biases (wandb) logging with NeMo 2.0:
First, ensure you have wandb installed:
pip install wandb
Configure the WandbLogger and add it to your NeMoLogger:
from lightning.pytorch.loggers import WandbLogger # Create Wandb logger wandb_logger = WandbLogger( project="my_project", # Name of the W&B project name="my_experiment", # Name of this specific run entity="my_team", # Optional: username or team name config={}, # Optional: dictionary of hyperparameters ) # Add Wandb logger to NeMoLogger nemo_logger = NeMoLogger( wandb=wandb_logger, # Pass Wandb logger here ... )
The Weights & Biases logs will be automatically synced to your wandb account under the specified project
and name
. You can view your experiment metrics, system stats, and model artifacts through the wandb web interface.
Just as with the TensorBoard logger, the update_logger_directory
argument in NeMoLogger
controls whether to update the directory of the wandb logger to match the NeMo log dir. If set to True
, the wandb logger will also write to the same log directory.
AutoResume#
The AutoResume
class manages checkpoint paths and checks for existing checkpoints to restore from. Here’s an example of how it can be used:
from nemo.lightning import AutoResume
resume = AutoResume(
resume_if_exists=True,
resume_ignore_no_checkpoint=True,
resume_from_directory="checkpoint_dir_to_resume_from"
)
In the script, resume_from_directory
refers to the path of the checkpoint directory to resume from. If no resume_from_directory
is provided, the directory to resume from will default to log_dir / checkpoints
, where log_dir
is determined by the NemoLogger
instance as described in the previous section.
The resume_ignore_no_checkpoint
boolean determines whether to proceed without error in the case that resume_if_exists
is set to True
and no checkpoint is found in the checkpointing directory.
Ensure that the value of resume_if_exists
matches the argument passed into the NemoLogger
instance.
AutoResume
should be set up in a similar fashion to NeMoLogger
.
resume.setup(trainer, model)
Passing a model into the setup is optional. It is only required when importing a checkpoint from Hugging Face or other non-NeMo checkpoint formats.
Putting it All Together#
To put it all together, configuring loggers and checkpointers in NeMo 2.0 looks like this:
from lightning.pytorch.loggers import TensorBoardLogger
from lightning.pytorch.loggers import WandbLogger
checkpoint_callback = ModelCheckpoint(
save_last=True,
monitor="reduced_train_loss",
save_top_k=2,
every_n_train_steps=30,
dirpath='my_model_directory',
)
tensorboard = TensorBoardLogger(
save_dir="tb_logs",
name="experiment1",
)
wandb_logger = WandbLogger(
project="my_project",
name="my_experiment",
entity="my_team",
)
logger = nemo_logger = NeMoLogger(
log_dir='my_logging_dir',
name='experiment1',
use_datetime_version=False,
ckpt=checkpoint_callback,
tensorboard=tensorboard,
wandb=wandb_logger,
update_logger_directory=True,
)
resume = AutoResume(
resume_if_exists=True,
resume_ignore_no_checkpoint=True,
)
### setup your trainer here ###
nemo_logger.setup(
trainer,
getattr(resume, "resume_if_exists", False),
)
resume.setup(trainer)
Note that using both TensorBoardLogger
and WandbLogger
at the same time is possible, as shown here, but uncommon. This example is mainly for demonstration purposes, so please adapt it to your needs.