Checkpointing in NeMo Automodel
Introduction
During machine-learning experiments, the model-training routine regularly saves checkpoints. A checkpoint is a complete snapshot of a run that includes model weights, optimizer states, and other metadata required to resume training exactly where it left off. Writing these snapshots at regular intervals lets you recover quickly from crashes or pauses without losing progress.
NeMo AutoModel checkpoints capture the complete state of a distributed training run across multiple GPUs or nodes. This reduces memory overhead, improves GPU utilization, and allows training to be resumed with a different parallelism strategy.
NeMo AutoModel writes checkpoints in two formats: Hugging Face Safetensors and PyTorch Distributed Checkpointing (DCP). It also supports two layouts:
-
Consolidated Checkpoints: The complete model state is saved as a Hugging Face-compatible bundle, typically in a single file or a compact set of files with an index. Because tensors are not split across GPUs (unsharded), tools like Hugging Face, vLLM, and SGLang can load these checkpoints directly.
-
Sharded Checkpoints: During distributed training with parameter sharing, each GPU holds a subset (or βshardβ) of the full state, such as model weights and optimizer states. When checkpointing, each GPU writes its own shard independently without reconstructing the full model state.
We provide an overview of the different types of available checkpoint formats in the table below.
Changing between output formats can be done seamlessly through the recipeβs yaml configuration file:
Note:
save_consolidatedaccepts:
final(recommended): keep intermediate checkpoints sharded and export consolidated HF weights only for the final checkpoint.false: save sharded checkpoints only. Run the generatedmodel/consolidate.shhelper later if you need HF weights.every(or legacytrue): export consolidated HF weights during every checkpoint save. Use this only when every checkpoint must be immediately loadable by HF tools.AutoModel writes a
model/consolidate.shhelper next to safetensors model shards. Use this helper to create a Hugging Face-compatiblemodel/consolidated/directory after training forsave_consolidated: falsecheckpoints, or for earlier checkpoints when usingsave_consolidated: final. Creating consolidated Hugging Face weights requiresmodel_save_format: safetensors.
The optimizer states are always saved in DCP (.distcp extension) format.
Checkpoint Symbolic Links
NeMo AutoModel automatically creates symbolic links in the checkpoint directory to provide convenient access to important checkpoints:
- LATEST: Points to the most recently saved checkpoint. This is useful for resuming training from the last saved state.
- LOWEST_VAL: Points to the checkpoint with the lowest validation score/loss. This provides easy access to the best-performing checkpoint based on validation metrics, making it ideal for model evaluation or deployment.
These symbolic links eliminate the need to manually track checkpoint names or search through directories to find the best model. When validation is enabled in your training run, both links are automatically maintained and updated as training progresses.
Safetensors
To ensure seamless integration with the Hugging Face ecosystem, NeMo AutoModel saves checkpoints in the Safetensors format. Safetensors is a memory-safe, zero-copy alternative to Pythonβs pickle (PyTorch .bin), natively supported by Hugging Face Transformers, offering both safety and performance advantages over Python pickle-based approaches.
Key Benefits
- Native Hugging Face Compatibility: Checkpoints can be loaded directly into Hugging Face-compatible tools, including vLLM, SGLang, and others.
- Memory Safety and Speed: The Safetensors format prohibits saving serialized Python code, ensuring memory safety, and supports zero-copy loading for improved performance.
- Optional Consolidation: Sharded checkpoints can be merged into a standard Hugging Face model format for easier downstream use.
Most importantly, this format offers the added advantage of optionally consolidating multiple shards into a complete Hugging Face format model.
Example
The following command runs the LLM fine-tuning recipe on two GPUs and saves the resulting checkpoint in the Safetensors format:
In the above command we used the llama3_2_1b_squad.yaml config as a running example, adjust as necessary to your case.
More config examples can be found in our examples/ directory.
If youβre running on a single GPU, you can run:
After running for a few seconds, the standard output should be:
The checkpoints/ should have the following contents:
The epoch_0_step_20/ directory stores the full training state from step 20 of the first epoch, including both the model and optimizer states.
Because this example uses save_consolidated: final, intermediate checkpoints such as epoch_0_step_20/ do not include model/consolidated/ before the run reaches the final checkpoint. To export this intermediate checkpoint for Hugging Face-compatible tools, run the generated helper:
Run the helper from the AutoModel repo root so it can find tools/offline_hf_consolidation.py, or set CONSOLIDATION_TOOL=/path/to/tools/offline_hf_consolidation.py.
The helper defaults to one CPU worker process with five writer threads so it is safe on small machines. For large checkpoints, run it on a CPU compute node and increase parallelism:
NPROC_PER_NODE controls worker processes, and NUM_THREADS controls writer threads per process. Keep NPROC_PER_NODE * NUM_THREADS within your CPU allocation. You can also submit the helper to a CPU Slurm partition, for example:
By default, consolidated export uses the original Hugging Face safetensors headers when they are available. Ordinary floating-point tensors are restored to their original per-tensor HF dtype, such as BF16, FP16, or FP32, even if the saved sharded checkpoint uses a different floating dtype. If the run started from config-only weights or the original HF metadata is unavailable, export keeps the saved checkpoint dtype. If an original quantized or packed tensor was saved as a floating-point tensor, export leaves it as float and emits a warning.
You can request an explicit floating-point dtype cast during offline export:
Use CAST_DTYPE when the consolidated Hugging Face bundle should override the default per-tensor dtype behavior, such as CAST_DTYPE=bf16 to export ordinary floating-point tensors as BF16 for serving. Supported values include bf16, fp16, fp32, and fp64. Only ordinary floating-point tensors with a different source dtype are cast; tensors already in the cast dtype, FP8 tensors, and non-floating tensors are left unchanged.
The helper writes checkpoints/epoch_0_step_20/model/consolidated/. We can load and run that consolidated checkpoint using the Hugging Face Transformers API directly:
Although this example uses the Hugging Face Transformers API, the consolidated/ checkpoint is compatible with any Hugging Face-compatible tool, such as vLLM, SGLang, and others.
PEFT
When training with Parameter-Efficient Fine-Tuning (PEFT) techniques, only a small subset of model weights are updated β the rest of the model remains frozen. This dramatically reduces the size of the checkpoint, often to just a few megabytes.
PEFT checkpoints save adapter files directly under model/ and do not generate or need model/consolidate.sh.
Why Consolidated Adapter Checkpoints?
Because the PEFT state is so lightweight, sharded checkpointing adds unnecessary overhead. Instead, NeMo AutoModel automatically saves a compact Hugging Face-compatible adapter checkpoint when using PEFT. This makes it:
- easier to manage and share (just the adapters),
- compatible with Hugging Face Transformers out of the box,
- ideal for deployment and downstream evaluation.
Example: PEFT Fine-Tuning on Two GPUs
To fine-tune a model using PEFT and save a Hugging Faceβready checkpoint:
After training, youβll get a compact Safetensors adapter checkpoint that can be loaded directly with Hugging Face tools:
The example below showcases the direct compatibility of NeMo AutoModel with Hugging Face and PEFT:
PyTorch DCP
NeMo AutoModel also offers native PyTorch DCP checkpointing support (.distcp extension). Similar to Safetensors, it also provides the same features of load-time resharding and parallel saving.
As a simple example, we can run the following command to launch the training recipe on two GPUs.
After 20 steps, the following checkpoint will be saved:
If you rerun the script, NeMo AutoModel automatically detects and restores the most recent checkpoint.
Save Checkpoints When Using Docker
When training inside a Docker container (see Installation Guide), any files written to the containerβs filesystem are lost when the container exits (especially with --rm). To keep your checkpoints, you must bind-mount a host directory to the checkpoint path before starting the container:
You can also set a custom checkpoint directory through the YAML config or CLI override:
When using a custom path, make sure the corresponding host directory is mounted into the container with -v.
Mount additional host directories for datasets and the Hugging Face model cache to avoid re-downloading large models across container restarts. See the Installation Guide for a complete docker run example with all recommended mounts.
Enable Asynchronous Checkpointing
NeMo AutoModel can write checkpoints asynchronously to reduce training stalls caused by I/O. When enabled, checkpoint writes are scheduled in the background using PyTorch Distributed Checkpointingβs async API while training continues.
- Enable (YAML):
- Enable (CLI): add
--checkpoint.is_async Trueto your run command. - Requirements: PyTorch β₯ 2.9.0. If an older version is detected, async mode is automatically disabled.
- Behavior: At most one checkpoint uploads at a time; the next save waits for the previous upload to finish. The
LATESTsymlink is updated after the async save completes (may be deferred until the next save call). During PEFT, adapter model files are written synchronously on rank 0; optimizer states can still use async.
Advanced Usage: Save Additional States
You can also save additional states in NeMo AutoModel. By default, we also automatically checkpoint the dataloader, rng, and step_scheduler states which are necessary to resume training accurately. In full, a Safetensors consolidated checkpoint will look like this:
If you want to define a new state to be checkpointed in the recipe, the easiest way is to create a new attribute in the recipe class (defined using self. inside the recipe). Just make sure that the new attribute uses both the load_state_dict and state_dict methods.
Here is an example of what it might look like:
Inside your recipe class, define the new state as an instance attribute using self.new_state = NewState(...).