Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.

Migrate Checkpointing Configurations from NeMo 1.0 to NeMo 2.0

NeMo 1.0 (Previous Release)

In NeMo 1.0, distributed checkpointing was configured in the YAML configuration file.

# Distributed checkpoint setup
dist_ckpt_format: 'zarr' # Set to 'torch_dist' to use PyTorch distributed checkpoint format.
dist_ckpt_load_on_device: True # whether to load checkpoint weights directly on GPU or to CPU
dist_ckpt_parallel_save: False # if true, each worker will write its own part of the dist checkpoint
dist_ckpt_parallel_load: False # if true, each worker will load part of the dist checkpoint and exchange with NCCL. Might use some extra GPU memory
dist_ckpt_torch_dist_multiproc: 2 # number of extra processes per rank used during ckpt save with PyTorch distributed format
dist_ckpt_assume_constant_structure: False # set to True only if the state dict structure doesn't change within a single job. Allows caching some computation across checkpoint saves.
dist_ckpt_parallel_dist_opt: True # parallel save/load of a DistributedOptimizer. 'True' allows performant save and reshardable checkpoints. Set to 'False' only in order to minimize the number of checkpoint files.

NeMo 2.0 (New Release)

In NeMo 2.0, these settings are controlled from the MegatronStrategy.

from nemo.collections import llm
from nemo import lightning as nl

strategy = nl.MegatronStrategy(
    save_ckpt_format='zarr',
    ckpt_load_on_device=True,
    ckpt_parallel_save=False,
    ckpt_parallel_load=False,
    ckpt_assume_constant_structure=False,
    ckpt_parallel_save_optim=False,
)

nl.Trainer(
    strategy=strategy,
    ...
)

Migration Steps

  1. Locate the distributed checkpoint setup section in your NeMo 1.0 YAML config file.

  2. Pass the distributed checkpoint setup settings into MegatronStrategy:

    strategy = nl.MegatronStrategy(
        save_ckpt_format='zarr',
        ckpt_load_on_device=True,
        ckpt_parallel_save=False,
        ckpt_parallel_load=False,
        ckpt_torch_dist_multiproc=2,
        ckpt_assume_constant_structure=False,
        ckpt_parallel_save_optim=False,
    )
    

Note

Non-distributed checkpointing is not supported by NeMo 2.0.