Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Migrate Checkpointing Configurations from NeMo 1.0 to NeMo 2.0#

NeMo 1.0 (Previous Release)#

In NeMo 1.0, distributed checkpointing was configured in the YAML configuration file.

# Distributed checkpoint setup
dist_ckpt_format: 'zarr' # Set to 'torch_dist' to use PyTorch distributed checkpoint format.
dist_ckpt_load_on_device: True # whether to load checkpoint weights directly on GPU or to CPU
dist_ckpt_parallel_save: False # if true, each worker will write its own part of the dist checkpoint
dist_ckpt_parallel_load: False # if true, each worker will load part of the dist checkpoint and exchange with NCCL. Might use some extra GPU memory
dist_ckpt_torch_dist_multiproc: 2 # number of extra processes per rank used during ckpt save with PyTorch distributed format
dist_ckpt_assume_constant_structure: False # set to True only if the state dict structure doesn't change within a single job. Allows caching some computation across checkpoint saves.
dist_ckpt_parallel_dist_opt: True # parallel save/load of a DistributedOptimizer. 'True' allows performant save and reshardable checkpoints. Set to 'False' only in order to minimize the number of checkpoint files.

NeMo 2.0 (New Release)#

In NeMo 2.0, these settings are controlled from the MegatronStrategy.

from nemo.collections import llm
from nemo import lightning as nl

strategy = nl.MegatronStrategy(
    save_ckpt_format='zarr',
    ckpt_load_on_device=True,
    ckpt_parallel_save=False,
    ckpt_parallel_load=False,
    ckpt_assume_constant_structure=False,
    ckpt_parallel_save_optim=False,
)

nl.Trainer(
    strategy=strategy,
    ...
)

Migration Steps#

  1. Locate the distributed checkpoint setup section in your NeMo 1.0 YAML config file.

  2. Pass the distributed checkpoint setup settings into MegatronStrategy:

    strategy = nl.MegatronStrategy(
        save_ckpt_format='zarr',
        ckpt_load_on_device=True,
        ckpt_parallel_save=False,
        ckpt_parallel_load=False,
        ckpt_torch_dist_multiproc=2,
        ckpt_assume_constant_structure=False,
        ckpt_parallel_save_optim=False,
    )
    

Note

Non-distributed checkpointing is not supported by NeMo 2.0.