Checkpointing in NeMo Automodel | NVIDIA NeMo AutoModel

Introduction

During machine-learning experiments, the model-training routine regularly saves checkpoints. A checkpoint is a complete snapshot of a run that includes model weights, optimizer states, and other metadata required to resume training exactly where it left off. Writing these snapshots at regular intervals lets you recover quickly from crashes or pauses without losing progress.

NeMo AutoModel checkpoints capture the complete state of a distributed training run across multiple GPUs or nodes. This reduces memory overhead, improves GPU utilization, and allows training to be resumed with a different parallelism strategy.

NeMo AutoModel writes checkpoints in two formats: Hugging Face Safetensors and PyTorch Distributed Checkpointing (DCP). It also supports two layouts:

Consolidated Checkpoints: The complete model state is saved as a Hugging Face-compatible bundle, typically in a single file or a compact set of files with an index. Because tensors are not split across GPUs (unsharded), tools like Hugging Face, vLLM, and SGLang can load these checkpoints directly.
Sharded Checkpoints: During distributed training with parameter sharing, each GPU holds a subset (or “shard”) of the full state, such as model weights and optimizer states. When checkpointing, each GPU writes its own shard independently without reconstructing the full model state.

We provide an overview of the different types of available checkpoint formats in the table below.

Task	Model domain	DCP (sharded)	Safetensors (sharded)	Safetensors (consolidated)
SFT	LLM	✅	✅	✅
SFT	VLM	✅	✅	✅
PEFT	LLM / VLM	🚧	🚧	✅

Changing between output formats can be done seamlessly through the recipe’s yaml configuration file:

1 checkpoint:
2     ...
3     model_save_format: safetensors # Format for saving (torch_save or safetensors)
4     save_consolidated: final # Recommended: export consolidated HF weights only for the final checkpoint.
5                              # Other modes: false (sharded only) or every/true (export every checkpoint).
6     ...

Note: save_consolidated accepts:

final (recommended): keep intermediate checkpoints sharded and export consolidated HF weights only for the final checkpoint.

false: save sharded checkpoints only. Run the generated model/consolidate.sh helper later if you need HF weights.

every (or legacy true): export consolidated HF weights during every checkpoint save. Use this only when every checkpoint must be immediately loadable by HF tools.

AutoModel writes a model/consolidate.sh helper next to safetensors model shards. Use this helper to create a Hugging Face-compatible model/consolidated/ directory after training for save_consolidated: false checkpoints, or for earlier checkpoints when using save_consolidated: final. Creating consolidated Hugging Face weights requires model_save_format: safetensors.

The optimizer states are always saved in DCP (.distcp extension) format.

Checkpoint Symbolic Links

NeMo AutoModel automatically creates symbolic links in the checkpoint directory to provide convenient access to important checkpoints:

LATEST: Points to the most recently saved checkpoint. This is useful for resuming training from the last saved state.
LOWEST_VAL: Points to the checkpoint with the lowest validation score/loss. This provides easy access to the best-performing checkpoint based on validation metrics, making it ideal for model evaluation or deployment.

These symbolic links eliminate the need to manually track checkpoint names or search through directories to find the best model. When validation is enabled in your training run, both links are automatically maintained and updated as training progresses.

Safetensors

To ensure seamless integration with the Hugging Face ecosystem, NeMo AutoModel saves checkpoints in the Safetensors format. Safetensors is a memory-safe, zero-copy alternative to Python’s pickle (PyTorch .bin), natively supported by Hugging Face Transformers, offering both safety and performance advantages over Python pickle-based approaches.

Key Benefits

Native Hugging Face Compatibility: Checkpoints can be loaded directly into Hugging Face-compatible tools, including vLLM, SGLang, and others.
Memory Safety and Speed: The Safetensors format prohibits saving serialized Python code, ensuring memory safety, and supports zero-copy loading for improved performance.
Optional Consolidation: Sharded checkpoints can be merged into a standard Hugging Face model format for easier downstream use.

Most importantly, this format offers the added advantage of optionally consolidating multiple shards into a complete Hugging Face format model.

Example

The following command runs the LLM fine-tuning recipe on two GPUs and saves the resulting checkpoint in the Safetensors format:

$ automodel --nproc-per-node=2 examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml \
>     --step_scheduler.ckpt_every_steps 20 \
>     --checkpoint.model_save_format safetensors \
>     --checkpoint.save_consolidated final

In the above command we used the llama3_2_1b_squad.yaml config as a running example, adjust as necessary to your case. More config examples can be found in our examples/ directory.

If you’re running on a single GPU, you can run:

$ automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml \
>     --step_scheduler.ckpt_every_steps 20 \
>     --checkpoint.model_save_format safetensors \
>     --checkpoint.save_consolidated final

After running for a few seconds, the standard output should be:

...
> Saving checkpoint to checkpoints/epoch_0_step_20
...

The checkpoints/ should have the following contents:

checkpoints/
├── LATEST -> epoch_0_step_20
├── LOWEST_VAL -> epoch_0_step_20
└── epoch_0_step_20
   ├── model
   │   ├── consolidate.sh
   │   ├── shard-00001-model-00001-of-00001.safetensors
   │   └── shard-00002-model-00001-of-00001.safetensors
   └── optim
       ├── __0_0.distcp
       └── __1_0.distcp
...

The epoch_0_step_20/ directory stores the full training state from step 20 of the first epoch, including both the model and optimizer states.

Because this example uses save_consolidated: final, intermediate checkpoints such as epoch_0_step_20/ do not include model/consolidated/ before the run reaches the final checkpoint. To export this intermediate checkpoint for Hugging Face-compatible tools, run the generated helper:

$ bash checkpoints/epoch_0_step_20/model/consolidate.sh

Run the helper from the AutoModel repo root so it can find tools/offline_hf_consolidation.py, or set CONSOLIDATION_TOOL=/path/to/tools/offline_hf_consolidation.py.

The helper defaults to one CPU worker process with five writer threads so it is safe on small machines. For large checkpoints, run it on a CPU compute node and increase parallelism:

$ NPROC_PER_NODE=16 NUM_THREADS=5 bash checkpoints/epoch_0_step_20/model/consolidate.sh

NPROC_PER_NODE controls worker processes, and NUM_THREADS controls writer threads per process. Keep NPROC_PER_NODE * NUM_THREADS within your CPU allocation. You can also submit the helper to a CPU Slurm partition, for example:

$ sbatch --cpus-per-task=80 --wrap='NPROC_PER_NODE=16 NUM_THREADS=5 bash /path/to/checkpoints/epoch_0_step_20/model/consolidate.sh'

By default, consolidated export uses the original Hugging Face safetensors headers when they are available. Ordinary floating-point tensors are restored to their original per-tensor HF dtype, such as BF16, FP16, or FP32, even if the saved sharded checkpoint uses a different floating dtype. If the run started from config-only weights or the original HF metadata is unavailable, export keeps the saved checkpoint dtype. If an original quantized or packed tensor was saved as a floating-point tensor, export leaves it as float and emits a warning.

You can request an explicit floating-point dtype cast during offline export:

$ CAST_DTYPE=bf16 bash checkpoints/epoch_0_step_20/model/consolidate.sh

Use CAST_DTYPE when the consolidated Hugging Face bundle should override the default per-tensor dtype behavior, such as CAST_DTYPE=bf16 to export ordinary floating-point tensors as BF16 for serving. Supported values include bf16, fp16, fp32, and fp64. Only ordinary floating-point tensors with a different source dtype are cast; tensors already in the cast dtype, FP8 tensors, and non-floating tensors are left unchanged.

The helper writes checkpoints/epoch_0_step_20/model/consolidated/. We can load and run that consolidated checkpoint using the Hugging Face Transformers API directly:

1 import torch
2 from transformers import pipeline
3 
4 model_id = "checkpoints/epoch_0_step_20/model/consolidated/"
5 pipe = pipeline(
6     "text-generation", 
7     model=model_id, 
8     torch_dtype=torch.bfloat16, 
9     device_map="auto",
10 )
11 
12 print(pipe("The key to life is"))
13 
14 >>> [{'generated_text': 'The key to life is to be happy. The key to happiness is to be kind. The key to kindness is to be'}]

Although this example uses the Hugging Face Transformers API, the consolidated/ checkpoint is compatible with any Hugging Face-compatible tool, such as vLLM, SGLang, and others.

PEFT

When training with Parameter-Efficient Fine-Tuning (PEFT) techniques, only a small subset of model weights are updated — the rest of the model remains frozen. This dramatically reduces the size of the checkpoint, often to just a few megabytes.

PEFT checkpoints save adapter files directly under model/ and do not generate or need model/consolidate.sh.

Why Consolidated Adapter Checkpoints?

Because the PEFT state is so lightweight, sharded checkpointing adds unnecessary overhead. Instead, NeMo AutoModel automatically saves a compact Hugging Face-compatible adapter checkpoint when using PEFT. This makes it:

easier to manage and share (just the adapters),
compatible with Hugging Face Transformers out of the box,
ideal for deployment and downstream evaluation.

Example: PEFT Fine-Tuning on Two GPUs

To fine-tune a model using PEFT and save a Hugging Face–ready checkpoint:

$ automodel --nproc-per-node=2 examples/llm_finetune/llama3_2/llama3_2_1b_hellaswag_peft.yaml --step_scheduler.ckpt_every_steps 20 --checkpoint.model_save_format safetensors

After training, you’ll get a compact Safetensors adapter checkpoint that can be loaded directly with Hugging Face tools:

checkpoints/
├── LATEST -> epoch_0_step_20
├── LOWEST_VAL -> epoch_0_step_20
├── epoch_0_step_20
│   ├── config.yaml
│   ├── dataloader
│   │   ├── dataloader_dp_rank_0.pt
│   │   └── dataloader_dp_rank_1.pt
│   ├── losses.json
│   ├── model
│   │   ├── adapter_config.json
│   │   ├── adapter_model.safetensors
│   │   ├── automodel_peft_config.json
│   │   ├── special_tokens_map.json
│   │   ├── tokenizer.json
│   │   └── tokenizer_config.json
│   ├── optim
│   │   ├── __0_0.distcp
│   │   └── __1_0.distcp
│   ├── rng
│   │   ├── rng_dp_rank_0.pt
│   │   └── rng_dp_rank_1.pt
│   └── step_scheduler.pt
├── training.jsonl
└── validation.jsonl

The example below showcases the direct compatibility of NeMo AutoModel with Hugging Face and PEFT:

1 from peft import AutoPeftModelForCausalLM
2 from transformers import AutoTokenizer
3 
4 checkpoint_path = "checkpoints/epoch_0_step_20/model/"
5 model = AutoPeftModelForCausalLM.from_pretrained(checkpoint_path)
6 tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
7 
8 model = model.to("cuda")
9 model.eval()
10 inputs = tokenizer("Preheat the oven to 350 degrees and place the cookie dough", return_tensors="pt")
11 
12 outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=50)
13 print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0])
14 
15 >>> Preheat the oven to 350 degrees and place the cookie dough in a large bowl. Roll the dough into 1-inch balls and place them on a cookie sheet. Bake the cookies for 10 minutes. While the cookies are baking, melt the chocolate chips in the microwave for 30 seconds.

PyTorch DCP

NeMo AutoModel also offers native PyTorch DCP checkpointing support (.distcp extension). Similar to Safetensors, it also provides the same features of load-time resharding and parallel saving.

As a simple example, we can run the following command to launch the training recipe on two GPUs.

$ automodel --nproc-per-node=2 examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml \
>     --step_scheduler.ckpt_every_steps 20 \
>     --checkpoint.model_save_format torch_save
$ 
$ ...
$ > Saving checkpoint to checkpoints/epoch_0_step_20
$ ...

After 20 steps, the following checkpoint will be saved:

checkpoints/
├── LATEST -> epoch_0_step_20
├── LOWEST_VAL -> epoch_0_step_20
└── epoch_0_step_20
   ├── config.yaml
   ├── dataloader
   │   ├── dataloader_dp_rank_0.pt
   │   └── dataloader_dp_rank_1.pt
   ├── losses.json
   ├── model
   │   ├── __0_0.distcp
   │   └── __1_0.distcp
   └── optim
       ├── __0_0.distcp
       └── __1_0.distcp
...

If you rerun the script, NeMo AutoModel automatically detects and restores the most recent checkpoint.

$ automodel --nproc-per-node=2 examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml \
>     --step_scheduler.ckpt_every_steps 20 \
>     --checkpoint.model_save_format torch_save
$ 
$ ...
$ > Loading checkpoint from checkpoints/epoch_0_step_20
$ ...

Save Checkpoints When Using Docker

When training inside a Docker container (see Installation Guide), any files written to the container’s filesystem are lost when the container exits (especially with --rm). To keep your checkpoints, you must bind-mount a host directory to the checkpoint path before starting the container:

$ docker run --gpus all -it --rm \
>   --shm-size=8g \
>   -v "$(pwd)"/checkpoints:/opt/Automodel/checkpoints \
>   nvcr.io/nvidia/nemo-automodel:26.06.00

You can also set a custom checkpoint directory through the YAML config or CLI override:

1 checkpoint:
2   checkpoint_dir: /mnt/shared/my_checkpoints

$ # Or via CLI override:
$ automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml \
>     --checkpoint.checkpoint_dir /mnt/shared/my_checkpoints

When using a custom path, make sure the corresponding host directory is mounted into the container with -v.

Mount additional host directories for datasets and the Hugging Face model cache to avoid re-downloading large models across container restarts. See the Installation Guide for a complete docker run example with all recommended mounts.

Enable Asynchronous Checkpointing

NeMo AutoModel can write checkpoints asynchronously to reduce training stalls caused by I/O. When enabled, checkpoint writes are scheduled in the background using PyTorch Distributed Checkpointing’s async API while training continues.

Enable (YAML):
```
1 checkpoint:
2   is_async: true
```
Enable (CLI): add --checkpoint.is_async True to your run command.
Requirements: PyTorch ≥ 2.9.0. If an older version is detected, async mode is automatically disabled.
Behavior: At most one checkpoint uploads at a time; the next save waits for the previous upload to finish. The LATEST symlink is updated after the async save completes (may be deferred until the next save call). During PEFT, adapter model files are written synchronously on rank 0; optimizer states can still use async.

Advanced Usage: Save Additional States

You can also save additional states in NeMo AutoModel. By default, we also automatically checkpoint the dataloader, rng, and step_scheduler states which are necessary to resume training accurately. In full, a Safetensors consolidated checkpoint will look like this:

checkpoints/
├── LATEST -> epoch_0_step_20
├── LOWEST_VAL -> epoch_0_step_20
├── epoch_0_step_20
│   ├── config.yaml
│   ├── dataloader
│   │   ├── dataloader_dp_rank_0.pt
│   │   └── dataloader_dp_rank_1.pt
│   ├── losses.json
│   ├── model
│   │   ├── consolidated
│   │   │   ├── config.json
│   │   │   ├── generation_config.json
│   │   │   ├── model-00001-of-00001.safetensors
│   │   │   ├── model.safetensors.index.json
│   │   │   ├── special_tokens_map.json
│   │   │   ├── tokenizer.json
│   │   │   └── tokenizer_config.json
│   │   ├── shard-00001-model-00001-of-00001.safetensors
│   │   └── shard-00002-model-00001-of-00001.safetensors
│   ├── optim
│   │   ├── __0_0.distcp
│   │   └── __1_0.distcp
│   ├── rng
│   │   ├── rng_dp_rank_0.pt
│   │   └── rng_dp_rank_1.pt
│   └── step_scheduler.pt
├── training.jsonl
└── validation.jsonl

If you want to define a new state to be checkpointed in the recipe, the easiest way is to create a new attribute in the recipe class (defined using self. inside the recipe). Just make sure that the new attribute uses both the load_state_dict and state_dict methods.

Here is an example of what it might look like:

1 class NewState:
2 
3     def __init__(self, ...):
4         self.state_value = ...
5         self.another_value = ...
6         ...
7     
8     def state_dict(self) -> dict[str, Any]:
9         return {
10             "<some state you're tracking>": self.state_value,
11             "<another state you're tracking>": self.another_value,
12         }
13     
14     def load_state_dict(self, state_dict: dict[str, Any]) -> None:
15         self.state_value = state_dict["<some state you're tracking>"]
16         self.another_value = state_dict["<another state you're tracking>"]

Inside your recipe class, define the new state as an instance attribute using self.new_state = NewState(...).

1	checkpoint:
2	is_async: true