Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.

Logging and Checkpointing

Three main classes in NeMo 2.0 are responsible for configuring logging and checkpointing directories:

nemo.lightning.pytorch.callbacks.model_checkpoint.ModelCheckpoint: Handles the logic that determines when to save a checkpoint. In addition, this class provides the ability to perform asynchronous checkpointing.
nemo.lightning.nemo_logger.NeMoLogger: Responsible for setting logging directories and (optionally) configuring the trainer’s loggers.
nemo.lightning.resume.AutoResume: Sets the checkpointing directory and determines whether there is an existing checkpoint from which to resume.

ModelCheckpoint

The ModelCheckpoint callback in NeMo 2.0 is a wrapper around Pytorch Lightning’s ModelCheckpoint. It manages when to save and clean up checkpoints during training. Additionally, it supports saving a checkpoint at the end of training and provides the necessary support for asynchronous checkpointing.

The following is an example of how to instantiate a ModelCheckpoint callback:

from nemo.lightning.pytorch.callbacks import ModelCheckpoint

checkpoint_callback = ModelCheckpoint(
    save_best_model=True,
    save_last=True,
    monitor="val_loss",
    save_top_k=2,
    every_n_train_steps=30,
    enable_nemo_ckpt_io=False,
    dirpath='my_model_directory',
)

Refer to the documentation for NeMo Lightning and PyTorch Lightning’s ModelCheckpoint classes to find the complete list of supported arguments. Here, dirpath refers to the directory to save the checkpoints. Note that dirpath is optional. If not provided, it will default to log_dir / checkpoints, where log_dir is the path determined by the NeMoLogger, as described in detail in the subsequent section.

In addition, note that asynchronous checkponting is set using the ckpt_async_save argument in MegatronStrategy. This attribute is then accessed by the checkpoint callback to perform async checkpointing as requested.

Two options are available to pass the ModelCheckpoint callback instance to the trainer.

Add the callback to the set of callbacks and then pass the callbacks directly to the trainer:

import nemo.lightning as nl

callbacks = [checkpoint_callback]
### add any other desired callbacks...

trainer = nl.Trainer(
    ...
    callbacks = callbacks,
    ...
)

Pass the callback to the NeMoLogger, as described below.

NeMoLogger

The NeMoLogger class is responsible for setting the logging directories for NeMo runs. There are a variety of arguments supported by the logger. Refer to the NeMoLogger documentation for a detailed description.

Here is an example of creating a new NeMoLogger instance:

from nemo.lightning import NeMoLogger
from pytorch_lightning.loggers import TensorBoardLogger

nemo_logger = NeMoLogger(
    dir='my_logging_dir',
    name='experiment1',
    use_datetime_version=False,
    tensorboard=TensorBoardLogger(
        save_dir='local_tb_path',
    )
    update_logger_directory=True,
)

By default, the log_dir where logs are written is dir / name / version. If an explicit version is not provided and use_datetime_version is False, the directory will change to dir / name.

TensorBoard and WandB loggers can also be configured using NeMoLogger. To set up a PTL logger, simply initialize the logger and pass it to the tensorboard or wandb argument of the NeMoLogger constructor. The update_logger_directory argument controls whether to update the directory of the PTL loggers to match the NeMo log dir. If set to True, the PTL logger will also write to the same log directory.

As mentioned earlier, you can optionally pass your ModelCheckpoint instance in here, and the logger will automatically configure the checkpoint callback in your trainer:

nemo_logger = NeMoLogger(
    ...
    ckpt=checkpoint_callback,
    ...
)

Once your trainer has been initialized, the NeMoLogger can be set up using the following command:

nemo_logger.setup(
    trainer,
    resume_if_exists,
)

The resume_if_exists boolean indicates whether to resume from the latest checkpoint, if one is available. The value of resume_if_exists should match the value passed into AutoResume as described below.

AutoResume

The AutoResume class manages checkpoint paths and checks for existing checkpoints to restore from. Here’s an example of how it can be used:

from nemo.lightning import AutoResume

resume = AutoResume(
    resume_if_exists=True,
    resume_ignore_no_checkpoint=True,
    dirpath="checkpoint_dir_to_resume_from"
)

In the script, dirpath refers to the path of the checkpoint directory to resume from. If no dirpath is provided, the directory to resume from will default to log_dir / checkpoints, where log_dir is determined by the NemoLogger instance as described in the previous section.

The resume_ignore_no_checkpoint boolean determines whether to proceed without error in the case that resume_if_exists is set to True and no checkpoint is found in the checkpointing directory.

Ensure that the value of resume_if_exists matches the argument passed into the NemoLogger instance.

AutoResume should be set up in a similar fashion to NeMoLogger.

resume.setup(trainer, model)

Passing a model into the setup is optional. It is only required when importing a checkpoint from Hugging Face or other non-NeMo checkpoint formats.

Putting it All Together

To put it all together, configuring loggers and checkpointers in NeMo 2.0 looks like this:

checkpoint_callback = ModelCheckpoint(
    save_best_model=True,
    save_last=True,
    monitor="reduced_train_loss",
    save_top_k=2,
    every_n_train_steps=30,
    enable_nemo_ckpt_io=False,
    dirpath='my_model_directory',
)

logger = nemo_logger = NeMoLogger(
    dir='my_logging_dir',
    name='experiment1',
    use_datetime_version=False,
    update_logger_directory=True,
    ckpt=checkpoint_callback,
)

resume = AutoResume(
    resume_if_exists=True,
    resume_ignore_no_checkpoint=True,
)

### setup your trainer here ###

nemo_logger.setup(
    trainer,
    getattr(resume, "resume_if_exists", False),
)
resume.setup(trainer)