Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Torch Distributed Checkpoint (TDC)

Overview

Torch Distributed Checkpoint enables saving and loading models from multiple ranks in parallel. You can use this parameter to save on any number of ranks in parallel.

Torch Distributed Checkpoint allows you to change tensor_model_parallel_size and pipeline_model_parallel_size for the same checkpoint even during the training session.

NeMo Framework supports TDC for GPT-based models such as GPT-3, Llama, etc.

Usage

Model Training

Setp up the TDC parameter in the model configuration:

dist_ckpt_format: 'torch_dist' # Set to 'torch_dist' to use PyTorch distributed checkpoint format.

Please, note that TDC is only supported with mcore_gpt=True.

Model Fine-Tuning

Define the configuration for FSDP in the model configuration:

dist_ckpt_format: 'torch_dist' # Set to 'torch_dist' to use PyTorch distributed checkpoint format.

Please, note that TDC is only supported with mcore_gpt=True.