Torch Distributed Checkpoint (TDC)

Overview

Torch Distributed Checkpoint enables saving and loading models from multiple ranks in parallel. You can use this parameter to save on any number of ranks in parallel.

Torch Distributed Checkpoint allows you to change tensor_model_parallel_size and pipeline_model_parallel_size for the same checkpoint even during the training session.

NeMo Framework supports TDC for GPT-based models such as GPT-3, Llama, etc.

Usage

Model Training

Setp up the TDC parameter in the model configuration:

dist_ckpt_format: 'torch_dist' # Set to 'torch_dist' to use PyTorch distributed checkpoint format.

Please, note that TDC is only supported with mcore_gpt=True.

Model Fine-Tuning

Define the configuration for FSDP in the model configuration:

dist_ckpt_format: 'torch_dist' # Set to 'torch_dist' to use PyTorch distributed checkpoint format.

Please, note that TDC is only supported with mcore_gpt=True.