Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Torch Distributed Checkpoint (TDC)
Overview
Torch Distributed Checkpoint enables saving and loading models from multiple ranks in parallel. You can use this parameter to save on any number of ranks in parallel.
Torch Distributed Checkpoint allows you to change tensor_model_parallel_size
and pipeline_model_parallel_size
for the same checkpoint even during the training session.
NeMo Framework supports TDC for GPT-based models such as GPT-3, Llama, etc.
Usage
Model Training
Setp up the TDC parameter in the model configuration:
dist_ckpt_format: 'torch_dist' # Set to 'torch_dist' to use PyTorch distributed checkpoint format.
Please, note that TDC is only supported with mcore_gpt=True
.
Model Fine-Tuning
Define the configuration for FSDP in the model configuration:
dist_ckpt_format: 'torch_dist' # Set to 'torch_dist' to use PyTorch distributed checkpoint format.
Please, note that TDC is only supported with mcore_gpt=True
.