Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Migrate Checkpointing Configurations from NeMo 1.0 to NeMo 2.0#
NeMo 1.0 (Previous Release)#
In NeMo 1.0, distributed checkpointing was configured in the YAML configuration file.
# Distributed checkpoint setup
dist_ckpt_format: 'zarr' # Set to 'torch_dist' to use PyTorch distributed checkpoint format.
dist_ckpt_load_on_device: True # whether to load checkpoint weights directly on GPU or to CPU
dist_ckpt_parallel_save: False # if true, each worker will write its own part of the dist checkpoint
dist_ckpt_parallel_load: False # if true, each worker will load part of the dist checkpoint and exchange with NCCL. Might use some extra GPU memory
dist_ckpt_torch_dist_multiproc: 2 # number of extra processes per rank used during ckpt save with PyTorch distributed format
dist_ckpt_assume_constant_structure: False # set to True only if the state dict structure doesn't change within a single job. Allows caching some computation across checkpoint saves.
dist_ckpt_parallel_dist_opt: True # parallel save/load of a DistributedOptimizer. 'True' allows performant save and reshardable checkpoints. Set to 'False' only in order to minimize the number of checkpoint files.
NeMo 2.0 (New Release)#
In NeMo 2.0, these settings are controlled from the MegatronStrategy.
from nemo.collections import llm
from nemo import lightning as nl
strategy = nl.MegatronStrategy(
save_ckpt_format='zarr',
ckpt_load_on_device=True,
ckpt_parallel_save=False,
ckpt_parallel_load=False,
ckpt_assume_constant_structure=False,
ckpt_parallel_save_optim=False,
)
nl.Trainer(
strategy=strategy,
...
)
Migration Steps#
Locate the distributed checkpoint setup section in your NeMo 1.0 YAML config file.
Pass the
distributed checkpoint setupsettings intoMegatronStrategy:strategy = nl.MegatronStrategy( save_ckpt_format='zarr', ckpt_load_on_device=True, ckpt_parallel_save=False, ckpt_parallel_load=False, ckpt_torch_dist_multiproc=2, ckpt_assume_constant_structure=False, ckpt_parallel_save_optim=False, )
Note
Non-distributed checkpointing is not supported by NeMo 2.0.