Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Migrate exp_manager to NeMoLogger and AutoResume#
In NeMo 2.0, the exp_manager
configuration has been replaced with NeMoLogger and AutoResume objects. This guide will help you migrate your experiment management setup.
NeMo 1.0 (Previous Release)#
In NeMo 1.0, experiment management was configured in the YAML configuration file.
exp_manager:
explicit_log_dir: null
exp_dir: null
name: megatron_gpt
create_wandb_logger: False
wandb_logger_kwargs:
project: null
name: null
resume_if_exists: True
resume_ignore_no_checkpoint: True
resume_from_checkpoint: ${model.resume_from_checkpoint}
create_checkpoint_callback: True
checkpoint_callback_params:
dirpath: null # to use S3 checkpointing, set the dirpath in format s3://bucket/key
monitor: val_loss
save_top_k: 10
mode: min
always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
save_nemo_on_train_end: False # not recommended when training large models on clusters with short time limits
filename: 'megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}'
model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
async_save: False # Set to True to enable async checkpoint save. Currently works only with distributed checkpoints
NeMo 2.0 (New Release)#
In NeMo 2.0, experiment management is configured using the NeMoLogger
and AutoResume
classes.
from nemo.collections import llm
from nemo import lightning as nl
from pytorch_lightning.loggers import WandbLogger
log = nl.NeMoLogger(
name="megatron_gpt",
log_dir=None, # This will default to ./nemo_experiments
explicit_log_dir=None,
version=None,
use_datetime_version=True,
log_local_rank_0_only=False,
log_global_rank_0_only=False,
files_to_copy=None,
update_logger_directory=True,
wandb=WandbLogger(project=None, name=None),
ckpt=nl.ModelCheckpoint(
dirpath=None, # to use S3 checkpointing, set the dirpath in format s3://bucket/key
monitor="val_loss",
save_top_k=10,
mode="min",
always_save_nemo=False,
save_nemo_on_train_end=False,
filename='megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}',
)
)
resume = nl.AutoResume(
path=None, # Equivalent to resume_from_checkpoint
dirpath=None,
import_path=None,
resume_if_exists=True,
resume_past_end=False,
resume_ignore_no_checkpoint=True,
)
llm.train(..., log=log, resume=resume)
Additionally, the NeMo 1.0 experiment manager provided the option to add some callbacks to the trainer. In NeMo 2.0, those callbacks can be passed directly to your trainer. Notably, the TimingCallback()
was used in NeMo 1.0 to log step times.
To add the TimingCallback
in NeMo 2.0, add the callback directly to the trainer:
import nemo.lightning as nl
from nemo.utils.exp_manager import TimingCallback
trainer = nl.Trainer(
...
callbacks=[TimingCallback()],
...
)
Migration Steps#
Remove the
exp_manager
section from your YAML config file.Add the following imports to your Python script:
from nemo import lightning as nl from pytorch_lightning.loggers import WandbLogger
Create a
NeMoLogger
object with the appropriate parameters:log = nl.NeMoLogger( name="megatron_gpt", log_dir=None, # This will default to ./nemo_experiments explicit_log_dir=None, version=None, use_datetime_version=True, log_local_rank_0_only=False, log_global_rank_0_only=False, files_to_copy=None, update_logger_directory=True, wandb=WandbLogger(project=None, name=None), ckpt=nl.ModelCheckpoint( dirpath=None, monitor="val_loss", save_top_k=10, mode="min", always_save_nemo=False, save_nemo_on_train_end=False, filename='megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}', async_save=False, ) )
Create an
AutoResume
object with the appropriate parameters:resume = nl.AutoResume( path=None, # Equivalent to resume_from_checkpoint dirpath=None, import_path=None, resume_if_exists=True, resume_past_end=False, resume_ignore_no_checkpoint=True, )
Add any callbacks you want to the trainer:
import nemo.lightning as nl from nemo.lightning.python.callbacks import PreemptionCallback from nemo.utils.exp_manager import TimingCallback callback = [TimingCallback(), PreemptionCallback()] trainer = nl.Trainer( ... callbacks=callbacks, ... )
Pass the
trainer
,log
, andresume
objects to thellm.train()
function:llm.train(..., trainer=trainer, log=log, resume=resume)
Adjust the parameters in
NeMoLogger
andAutoResume
to match your previous YAML configuration.
Note
The
model_parallel_size
parameter is no longer needed in the checkpoint configuration.For S3 checkpointing, set the
dirpath
in theModelCheckpoint
to the formats3://bucket/key
.