> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.checkpoint.checkpointing

## Module Contents

### Classes

| Name                                                                                         | Description                                                                |
| -------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- |
| [`Checkpointer`](#nemo_automodel-components-checkpoint-checkpointing-Checkpointer)           | High-level checkpoint manager built on torch.distributed.checkpoint (DCP). |
| [`_AsyncSaveContext`](#nemo_automodel-components-checkpoint-checkpointing-_AsyncSaveContext) | Internal container for async checkpointing state.                          |

### Functions

| Name                                                                                                                                             | Description                                                                                                        |
| ------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------ |
| [`_adapter_path`](#nemo_automodel-components-checkpoint-checkpointing-_adapter_path)                                                             | Return the PEFT adapter safetensors path inside a checkpoint dir (local or `msc://`).                              |
| [`_apply`](#nemo_automodel-components-checkpoint-checkpointing-_apply)                                                                           | Apply a transformation function to parameters (and gradients) only.                                                |
| [`_apply_key_mapping`](#nemo_automodel-components-checkpoint-checkpointing-_apply_key_mapping)                                                   | Rename state-dict keys using regex-based `key_mapping`.                                                            |
| [`_convert_checkpoint_with_transformers`](#nemo_automodel-components-checkpoint-checkpointing-_convert_checkpoint_with_transformers)             | Convert a checkpoint using transformers' conversion mapping for models that need tensor merging.                   |
| [`_divide_keys_by_size`](#nemo_automodel-components-checkpoint-checkpointing-_divide_keys_by_size)                                               | Assign keys to deterministic size-based shards.                                                                    |
| [`_ensure_dirs`](#nemo_automodel-components-checkpoint-checkpointing-_ensure_dirs)                                                               | Create directories on all ranks and synchronize across ranks.                                                      |
| [`_ensure_msc_available`](#nemo_automodel-components-checkpoint-checkpointing-_ensure_msc_available)                                             | Raise an error if MSC is not installed but a cloud path is used.                                                   |
| [`_equally_divide_layers`](#nemo_automodel-components-checkpoint-checkpointing-_equally_divide_layers)                                           | Equally divide the state dict keys into num\_shards shards.                                                        |
| [`_get_checkpoint_metadata_keys`](#nemo_automodel-components-checkpoint-checkpointing-_get_checkpoint_metadata_keys)                             | Return checkpoint FQNs present in metadata.                                                                        |
| [`_get_hf_safetensors_reference_path`](#nemo_automodel-components-checkpoint-checkpointing-_get_hf_safetensors_reference_path)                   | Return the local HF safetensors reference directory for a model.                                                   |
| [`_get_original_hf_index_total_size`](#nemo_automodel-components-checkpoint-checkpointing-_get_original_hf_index_total_size)                     | Return the original HF safetensors index total size, if available.                                                 |
| [`_init_peft_adapters`](#nemo_automodel-components-checkpoint-checkpointing-_init_peft_adapters)                                                 | Initialize the PEFT adapters with the scaled weights.                                                              |
| [`_is_bin_checkpoint`](#nemo_automodel-components-checkpoint-checkpointing-_is_bin_checkpoint)                                                   | Return True if path looks like a PyTorch .bin checkpoint.                                                          |
| [`_is_custom_model`](#nemo_automodel-components-checkpoint-checkpointing-_is_custom_model)                                                       | True if the model has a custom implementation in nemo\_automodel/components/models/.                               |
| [`_is_remote_code_model`](#nemo_automodel-components-checkpoint-checkpointing-_is_remote_code_model)                                             | True if the model was loaded with trust\_remote\_code (HF dynamic modules).                                        |
| [`_is_safetensors_checkpoint`](#nemo_automodel-components-checkpoint-checkpointing-_is_safetensors_checkpoint)                                   | Return True if path looks like a safetensors checkpoint (so we can preserve dtype); else DCP or other.             |
| [`_load_full_state_dict_into_model`](#nemo_automodel-components-checkpoint-checkpointing-_load_full_state_dict_into_model)                       | Load a full (non-sharded) state dict into a potentially FSDP-wrapped model.                                        |
| [`_load_hf_bin_checkpoint`](#nemo_automodel-components-checkpoint-checkpointing-_load_hf_bin_checkpoint)                                         | Load a HuggingFace .bin checkpoint into a state dict.                                                              |
| [`_load_hf_checkpoint_preserving_dtype`](#nemo_automodel-components-checkpoint-checkpointing-_load_hf_checkpoint_preserving_dtype)               | Load a HuggingFace checkpoint into a new state dict so tensor dtypes                                               |
| [`_load_hf_safetensors_checkpoint`](#nemo_automodel-components-checkpoint-checkpointing-_load_hf_safetensors_checkpoint)                         | Load a safetensors checkpoint into a state dict.                                                                   |
| [`_load_safetensors`](#nemo_automodel-components-checkpoint-checkpointing-_load_safetensors)                                                     | Read a safetensors file from a local path or an `msc://` cloud path.                                               |
| [`_materialize_to_hf_views_for_save`](#nemo_automodel-components-checkpoint-checkpointing-_materialize_to_hf_views_for_save)                     | Replace non-contiguous tensor values in `state_dict` with contiguous copies in place.                              |
| [`_maybe_adapt_state_dict_from_hf`](#nemo_automodel-components-checkpoint-checkpointing-_maybe_adapt_state_dict_from_hf)                         | Custom models use state dict adapters to convert the state dict from the Hugging Face format to the native format. |
| [`_maybe_adapt_state_dict_to_hf`](#nemo_automodel-components-checkpoint-checkpointing-_maybe_adapt_state_dict_to_hf)                             | Custom models use state dict adapters to convert the state dict to the Hugging Face format.                        |
| [`_maybe_msc_reader`](#nemo_automodel-components-checkpoint-checkpointing-_maybe_msc_reader)                                                     | Return an MSC filesystem reader for `msc://` paths, else the given reader.                                         |
| [`_maybe_msc_writer`](#nemo_automodel-components-checkpoint-checkpointing-_maybe_msc_writer)                                                     | Return an MSC filesystem writer for `msc://` paths, else the given writer.                                         |
| [`_model_has_dtensors`](#nemo_automodel-components-checkpoint-checkpointing-_model_has_dtensors)                                                 | True if any parameter is a DTensor (model is already sharded).                                                     |
| [`_normalize_dtype_mapping_to_state_dict_keys`](#nemo_automodel-components-checkpoint-checkpointing-_normalize_dtype_mapping_to_state_dict_keys) | Align original HF dtype metadata with the keys that will be exported.                                              |
| [`_reinit_non_persistent_buffers`](#nemo_automodel-components-checkpoint-checkpointing-_reinit_non_persistent_buffers)                           | Recompute non-persistent buffers that are not saved in checkpoints.                                                |
| [`_save_safetensors`](#nemo_automodel-components-checkpoint-checkpointing-_save_safetensors)                                                     | Write a safetensors file to a local path or an `msc://` cloud path.                                                |
| [`_should_write_consolidated_safetensors`](#nemo_automodel-components-checkpoint-checkpointing-_should_write_consolidated_safetensors)           | Whether to output consolidated HF weights along with sharded weights.                                              |
| [`_should_write_hf_metadata`](#nemo_automodel-components-checkpoint-checkpointing-_should_write_hf_metadata)                                     | Whether to write HF metadata/artifacts for a checkpoint.                                                           |
| [`_summarize_state_dict_key_diff`](#nemo_automodel-components-checkpoint-checkpointing-_summarize_state_dict_key_diff)                           | Summarize state-dict key mismatches for checkpoint load diagnostics.                                               |
| [`_warn_if_inline_consolidation_enabled`](#nemo_automodel-components-checkpoint-checkpointing-_warn_if_inline_consolidation_enabled)             | Educate users about the cost of inline HF consolidation.                                                           |
| [`_warn_if_large_inline_consolidation`](#nemo_automodel-components-checkpoint-checkpointing-_warn_if_large_inline_consolidation)                 | Warn when inline consolidated export is large enough to waste GPU allocation time.                                 |
| [`is_cloud_path`](#nemo_automodel-components-checkpoint-checkpointing-is_cloud_path)                                                             | Check if path is a cloud storage path (MSC).                                                                       |
| [`save_config`](#nemo_automodel-components-checkpoint-checkpointing-save_config)                                                                 | Save a config to a weights path.                                                                                   |
| [`to_empty_parameters_only`](#nemo_automodel-components-checkpoint-checkpointing-to_empty_parameters_only)                                       | Move parameters to the specified device without copying storage, skipping buffers.                                 |

### Data

[`MSC_AVAILABLE`](#nemo_automodel-components-checkpoint-checkpointing-MSC_AVAILABLE)

[`_CONSOLIDATED_SIZE_WARNING_THRESHOLD_BYTES`](#nemo_automodel-components-checkpoint-checkpointing-_CONSOLIDATED_SIZE_WARNING_THRESHOLD_BYTES)

[`_DEFAULT_HF_CONSOLIDATED_SHARD_SIZE_BYTES`](#nemo_automodel-components-checkpoint-checkpointing-_DEFAULT_HF_CONSOLIDATED_SHARD_SIZE_BYTES)

[`_MODELS_REQUIRING_BUFFER_REINIT`](#nemo_automodel-components-checkpoint-checkpointing-_MODELS_REQUIRING_BUFFER_REINIT)

[`logger`](#nemo_automodel-components-checkpoint-checkpointing-logger)

### API

```python
class nemo_automodel.components.checkpoint.checkpointing.Checkpointer(
    config: nemo_automodel.components.checkpoint.config.CheckpointingConfig,
    dp_rank: int,
    tp_rank: int,
    pp_rank: int,
    moe_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None
)
```

High-level checkpoint manager built on torch.distributed.checkpoint (DCP).

Supports:

* HF sharded safetensors via custom storage reader/writer
* Optional consolidated export (config, generation config, tokenizer)
* PEFT adapter save/load handling
* Async save for torch >= 2.9.0

Also provides DP-aware helpers for saving/loading auxiliary state and
utilities to initialize from a base HF checkpoint.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer._do_load(
    state_dict: dict[str, torch.Tensor],
    path: str,
    storage_reader: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader] = None,
    is_init_step: bool = False
) -> dict[str, torch.Tensor]
```

Load a state dictionary from `path` using DCP or PEFT special-case logic.

**Parameters:**

Mutable state dict to populate with tensors.

Checkpoint directory path.

Optional HF storage reader for safetensors.

True if loading from a base checkpoint during initialization.

**Returns:** `dict[str, torch.Tensor]`

The populated state dictionary (may be replaced for PEFT).

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer._do_save(
    state_dict: dict[str, torch.Tensor],
    path: str,
    storage_writer: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageWriter] = None
) -> typing.Optional[torch.distributed.checkpoint.state_dict_saver.AsyncSaveResponse]
```

Save a state dictionary to `path` using DCP or PEFT special-case logic.

* For PEFT model saves: only rank 0 writes `adapter_model.safetensors`.
* If async mode is enabled, schedule an asynchronous save.

**Parameters:**

State dict to be serialized.

Checkpoint directory path.

Optional HF storage writer for safetensors sharding.

**Returns:** `Optional[AsyncSaveResponse]`

Optional Future object if async mode is enabled.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer._get_original_model_path(
    model_state: nemo_automodel.components.checkpoint.stateful_wrappers.ModelState
) -> str | None
```

Get the path to the original model from the Hugging Face checkpoint.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer._get_storage_reader(
    model_path: str,
    key_mapping: typing.Optional[dict[str, str]],
    is_init_step: bool = False,
    is_safetensors: bool | None = None
) -> typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader]
```

Construct a Hugging Face storage reader when loading safetensors or during init.

Prefers the upstream `torch.distributed.checkpoint.hf_storage.HuggingFaceStorageReader`
when no `key_mapping` is needed, since it uses safetensors' native `get_slice()` for
efficient partial reads (only the bytes for the local DTensor shard are read from disk).
Falls back to the backported reader when `key_mapping` is required or when the upstream
reader is not available.

**Parameters:**

Path to the model checkpoint directory or HF snapshot.

Optional key remapping for conversion.

If True, always produce a reader for base HF load.

Whether `model_path` holds a safetensors checkpoint; computed
from the directory contents when not supplied.

**Returns:** `Optional[_HuggingFaceStorageReader]`

Configured storage reader, or None for the default DCP FileSystemReader.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer._get_storage_writer(
    consolidated_output_path: typing.Optional[str],
    fqn_to_index_mapping: typing.Optional[dict[str, int]],
    fqn_to_dtype_mapping: typing.Optional[dict[str, str]],
    model_path: str,
    consolidate_on_all_ranks: bool = False
) -> typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageWriter]
```

Construct a Hugging Face storage writer for sharded safetensors.

**Parameters:**

Optional path for consolidated artifacts.

Optional mapping from FQN to shard index.

Optional mapping from FQN to original HF safetensors dtype string.

Path where the model checkpoint is saved.

If True, consolidate on all ranks on the main process.

**Returns:** `Optional[_HuggingFaceStorageWriter]`

Configured `_HuggingFaceStorageWriter` or None for non-safetensors.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer._maybe_build_consolidated_index(
    model_state: nemo_automodel.components.checkpoint.stateful_wrappers.ModelState,
    state_dict: dict[str, torch.Tensor]
) -> typing.Optional[dict[str, int]]
```

Build FQN to shard index mapping for consolidated HF export.

Uses the base checkpoint index (if present), removes non-persistent keys,
and assigns new keys to the last shard by default.

**Parameters:**

Wrapper exposing the primary model part.

The state dict that will be saved.

**Returns:** `Optional[dict[str, int]]`

Mapping from FQN to shard index, or None when not consolidating.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer._maybe_build_original_dtype_mapping(
    model_state: nemo_automodel.components.checkpoint.stateful_wrappers.ModelState,
    state_dict: dict[str, torch.Tensor]
) -> typing.Optional[dict[str, str]]
```

Build FQN to original HF safetensors dtype mapping for consolidated export.

Returns None when the run started from config-only weights or the original HF
safetensors headers are not available. In that case consolidation keeps the
saved checkpoint dtype unless the user explicitly passes CAST\_DTYPE to the
offline helper.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer._maybe_log_final_offline_consolidation_hint(
    model_dir: str,
    is_final_checkpoint: bool = False
) -> None
```

Log the final-checkpoint helper hint when consolidated export was disabled.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer._maybe_write_offline_consolidation_script(
    model_dir: str
) -> None
```

Write a conservative helper script for offline HF safetensors consolidation.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer.async_wait() -> None
```

Wait for the async save to finish.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer.close() -> None
```

Close the checkpointer.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer.initialize_model_weights(
    model: torch.nn.Module,
    device: torch.device,
    peft_init_method: str | None = None
) -> None
```

staticmethod

Materialize meta-device parameters and initialize model weights.

Moves empty parameter shells to the target device, resets HF initialization
flags, calls the model's weight initialization method, and initializes any
PEFT adapters.

**Parameters:**

Model whose weights should be initialized.

Target device for materialized parameters.

Initialization method for PEFT adapters (e.g. "xavier").

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_base_model(
    model: torch.nn.Module,
    device: torch.device,
    root_dir: str,
    model_name: str | None,
    load_base_model: bool = True
) -> None
```

Load a model from the base Hugging Face checkpoint in parallel.

**Parameters:**

Model to load state into

Device to load model onto

Root directory of the model cache or snapshots

Name of the model or an absolute path to a snapshot

If True, restore from HF base checkpoint

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_distributed_state(
    state: typing.Any,
    state_name: str,
    path: str
) -> None
```

Load a custom stateful object previously saved with DCP.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_model(
    model: torch.nn.Module,
    model_path: str,
    is_init_step: bool = False,
    use_checkpoint_id: bool = True,
    key_mapping: typing.Optional[dict[str, str]] = None,
    allow_checkpoint_key_subset: bool = False
) -> None
```

Load model weights from `model_path`.

Behavior:

* For PEFT (non-init): rank 0 reads `adapter_model.safetensors`, then broadcasts.
* Otherwise: use DCP with a Hugging Face or default storage reader to populate the state dict.
* If the model exposes a `state_dict_adapter`, convert to/from HF format as needed.
* For models requiring tensor merging (e.g., Mixtral), uses transformers' conversion mapping.

**Parameters:**

Model or parallelized model parts to load into.

Path to the model checkpoint directory or HF snapshot.

If True, treat load as initialization from a base checkpoint.

Pass `checkpoint_id` to DCP if True; disable when using direct HF paths.

Optional key remapping when reading from HF checkpoints.

If True, keep the model's current initialization for
parameters that are absent from the checkpoint instead of requiring an exact key match.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_on_dp_ranks(
    state: typing.Any,
    state_name: str,
    path: str
) -> None
```

Load the stateful object.

This function is a helper function currently used to load the dataloader and rng state.

**Parameters:**

Stateful object to load

Name of the stateful object

Path to load stateful object

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_optimizer(
    optimizer: torch.optim.Optimizer,
    model: torch.nn.Module,
    weights_path: str,
    scheduler: typing.Optional[typing.Any] = None
) -> None
```

Load optimizer (and optional scheduler) state from `weights_path/optim` using DCP.

**Parameters:**

Optimizer to populate.

Model providing partitioning context for the optimizer wrapper.

Base directory for checkpoints.

Optional LR scheduler to populate.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer.maybe_wait_for_staging() -> None
```

Wait for the staging to finish if it is enabled.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer.save_distributed_state(
    state: typing.Any,
    state_name: str,
    path: str
) -> None
```

Save a custom stateful object through DCP on all ranks.

This is intended for auxiliary objects whose state dict contains
sharded tensors, for example BAGEL EMA shadows under FSDP2. Rank-0
`torch.save` would only persist rank 0's local shard; DCP sees the
DTensor metadata and writes all shards correctly.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer.save_model(
    model: torch.nn.Module,
    weights_path: str,
    peft_config: typing.Optional[peft.PeftConfig] = None,
    tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None,
    is_final_checkpoint: bool = False
) -> None
```

Save model weights to `weights_path/model`.

Behavior:

* PEFT: write `adapter_model.safetensors` and metadata on rank 0.
* Safetensors + consolidation: emit HF artifacts under
  `weights_path/model/consolidated` and build a consolidated index.
* Otherwise: use DCP with a Hugging Face or default storage writer to save shards.

**Parameters:**

Model to checkpoint.

Base directory for checkpoints.

Optional PEFT configuration when saving adapters.

Optional tokenizer to save with consolidated artifacts.

Whether this save is the final scheduled training checkpoint.

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer.save_on_dp_ranks(
    state: typing.Any,
    state_name: str,
    path: str
) -> None
```

Save the stateful object.

This function is a helper function currently used to save the dataloader and rng state.

**Parameters:**

Stateful object to save

Name of the stateful object

Path to save stateful object

```python
nemo_automodel.components.checkpoint.checkpointing.Checkpointer.save_optimizer(
    optimizer: torch.optim.Optimizer,
    model: torch.nn.Module,
    weights_path: str,
    scheduler: typing.Optional[typing.Any] = None
) -> None
```

Save optimizer (and optional scheduler) state to `weights_path/optim` using DCP.

**Parameters:**

Optimizer whose state will be saved.

Model providing partitioning context for the optimizer wrapper.

Base directory for checkpoints.

Optional LR scheduler to include.

```python
class nemo_automodel.components.checkpoint.checkpointing._AsyncSaveContext(
    stager: typing.Any | None,
    process_group: typing.Any | None,
    future: typing.Any | None,
    staging_active: bool = False
)
```

Dataclass

Internal container for async checkpointing state.

One instance is maintained for the model save and one for the optimizer save
to keep staging/upload futures and the associated process group and stager
together in a single place.

```python
nemo_automodel.components.checkpoint.checkpointing._adapter_path(
    checkpoint_dir: str
) -> str
```

Return the PEFT adapter safetensors path inside a checkpoint dir (local or `msc://`).

```python
nemo_automodel.components.checkpoint.checkpointing._apply(
    module,
    fn,
    recurse = True
) -> torch.nn.Module
```

Apply a transformation function to parameters (and gradients) only.

Mirrors `nn.Module.to_empty` for parameters while skipping buffers. Respects
future flags controlling in-place vs swap behavior and safely handles
wrapper subclasses.

**Parameters:**

Module whose parameters are to be transformed.

Callable applied to each parameter (and its gradient).

Whether to recurse into child modules.

**Returns:** `nn.Module`

The same module instance after transformation.

```python
nemo_automodel.components.checkpoint.checkpointing._apply_key_mapping(
    state_dict: dict[str, torch.Tensor],
    key_mapping: dict[str, str]
) -> dict[str, torch.Tensor]
```

Rename state-dict keys using regex-based `key_mapping`.

This mirrors the renaming logic used by the DCP / HuggingFace storage
reader but operates directly on an in-memory state dict.  It is needed
when loading safetensors checkpoints outside of DCP so that HF checkpoint
keys (e.g. `language_model.model.X`) are translated to the model's
parameter FQNs (e.g. `model.language_model.X`).

**Parameters:**

Original state dict whose keys may need renaming.

`&#123;regex_pattern: replacement&#125;` pairs applied in order.

**Returns:** `dict[str, torch.Tensor]`

A new dict with renamed keys.

```python
nemo_automodel.components.checkpoint.checkpointing._convert_checkpoint_with_transformers(
    model: torch.nn.Module,
    model_path: str,
    key_mapping: typing.Optional[dict[str, str]] = None
) -> typing.Optional[dict[str, torch.Tensor]]
```

Convert a checkpoint using transformers' conversion mapping for models that need tensor merging.

This handles MoE models like Mixtral where the checkpoint has individual expert weights
but the model uses grouped expert tensors. The transformers library's WeightConverter
operations handle the tensor merging (MergeModulelist, Concatenate).

This function converts the state dict WITHOUT loading it into the model, so it can be
used with FSDP-aware loading mechanisms.

**Parameters:**

The model (used to get conversion mapping and target keys).

Path to the HuggingFace checkpoint directory.

Optional additional key mapping.

**Returns:** `Optional[dict[str, torch.Tensor]]`

Converted state dict ready for loading, or None if conversion failed.

```python
nemo_automodel.components.checkpoint.checkpointing._divide_keys_by_size(
    keys: list[str],
    state_dict: dict[str, torch.Tensor],
    target_shard_bytes: int
) -> dict[str, int]
```

Assign keys to deterministic size-based shards.

```python
nemo_automodel.components.checkpoint.checkpointing._ensure_dirs(
    dirs: typing.Optional[str] = ()
) -> None
```

Create directories on all ranks and synchronize across ranks.

**Parameters:**

One or more directory paths that should exist.

```python
nemo_automodel.components.checkpoint.checkpointing._ensure_msc_available() -> None
```

Raise an error if MSC is not installed but a cloud path is used.

```python
nemo_automodel.components.checkpoint.checkpointing._equally_divide_layers(
    num_shards: int,
    keys: list[str]
) -> dict[str, int]
```

Equally divide the state dict keys into num\_shards shards.

```python
nemo_automodel.components.checkpoint.checkpointing._get_checkpoint_metadata_keys(
    path: str,
    storage_reader: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader] = None
) -> set[str]
```

Return checkpoint FQNs present in metadata.

```python
nemo_automodel.components.checkpoint.checkpointing._get_hf_safetensors_reference_path(
    cache_dir: str | pathlib.Path | None,
    repo_id: str | None
) -> str | None
```

Return the local HF safetensors reference directory for a model.

Prefer the snapshot directory containing `model.safetensors.index.json` for
sharded checkpoints. If no index exists but a snapshot directory is present,
return that directory as the single-file safetensors reference path. Return
None when `repo_id` is None or the repo has no cached snapshot directory.

For example, if the located file is

/opt/models/models--meta-llama--Llama-3.2-3B/snapshots/13afe.../model.safetensors.index.json

this function will return the directory path

/opt/models/models--meta-llama--Llama-3.2-3B/snapshots/13afe...

This will error if the model hasn't been downloaded or if the cache directory is incorrect.

**Parameters:**

Path to cache directory

Hugging Face repository ID

**Returns:** `str | None`

Path to the snapshot/model directory containing safetensors weights, or

```python
nemo_automodel.components.checkpoint.checkpointing._get_original_hf_index_total_size(
    config: nemo_automodel.components.checkpoint.config.CheckpointingConfig
) -> int | None
```

Return the original HF safetensors index total size, if available.

```python
nemo_automodel.components.checkpoint.checkpointing._init_peft_adapters(
    model: torch.nn.Module,
    peft_init_method: str
) -> None
```

Initialize the PEFT adapters with the scaled weights.

**Parameters:**

Model to initialize PEFT adapters for

Method to initialize PEFT adapters e.g. "xavier". See `LinearLoRA` for more details.

```python
nemo_automodel.components.checkpoint.checkpointing._is_bin_checkpoint(
    path: str
) -> bool
```

Return True if path looks like a PyTorch .bin checkpoint.

```python
nemo_automodel.components.checkpoint.checkpointing._is_custom_model(
    module: torch.nn.Module
) -> bool
```

True if the model has a custom implementation in nemo\_automodel/components/models/.

The generic HFCheckpointingMixin (in .common.hf\_checkpointing\_mixin) is
injected into every model by \_get\_mixin\_wrapped\_class and does NOT count
as a "custom model".  Only actual model implementations (e.g. llama,
deepseek\_v3) that live under nemo\_automodel.components.models qualify.

```python
nemo_automodel.components.checkpoint.checkpointing._is_remote_code_model(
    module: torch.nn.Module
) -> bool
```

True if the model was loaded with trust\_remote\_code (HF dynamic modules).

```python
nemo_automodel.components.checkpoint.checkpointing._is_safetensors_checkpoint(
    path: str
) -> bool
```

Return True if path looks like a safetensors checkpoint (so we can preserve dtype); else DCP or other.

```python
nemo_automodel.components.checkpoint.checkpointing._load_full_state_dict_into_model(
    model_parts: list[torch.nn.Module],
    state_dict: dict[str, torch.Tensor]
) -> None
```

Load a full (non-sharded) state dict into a potentially FSDP-wrapped model.

Every rank must supply the **full** state dict.  PyTorch's
`set_model_state_dict` with `full_state_dict=True` (but **not**
`broadcast_from_rank0`) calls `_distribute_state_dict` which lets
each rank independently slice its local DTensor shard from the full
tensor -- no NCCL collectives are needed.

We intentionally avoid `broadcast_from_rank0=True` because it
introduces an asymmetric workload: rank 0 does a synchronous CPU→GPU
copy (`.to(device)`) per tensor while other ranks only do
`torch.empty` (async allocation).  The non-src ranks race ahead
enqueuing hundreds of NCCL broadcasts that rank 0 cannot keep up with,
leading to a 60 s NCCL watchdog timeout.

After loading, floating-point parameters are converted to match the
checkpoint dtype.  PyTorch's `set_model_state_dict` uses *copy*
semantics (`assign=False`) for non-meta parameters, which preserves
the model's initialisation dtype instead of the checkpoint dtype.
The post-load fixup ensures the safetensors dtype (e.g. bf16) is
honoured.

**Parameters:**

List of model parts (for pipeline parallelism)

Full state dict with regular tensors.  Must be
populated on **every** rank (not just rank 0).

```python
nemo_automodel.components.checkpoint.checkpointing._load_hf_bin_checkpoint(
    model_path: str,
    weights_only: bool = True
) -> typing.Optional[dict[str, torch.Tensor]]
```

Load a HuggingFace .bin checkpoint into a state dict.

Handles single-file (pytorch\_model.bin), sharded (pytorch\_model.bin.index.json),
and glob fallback (\*.bin) layouts.
Returns None if no .bin files are found.

**Parameters:**

Path to checkpoint file or directory.

Passed to `torch.load`.  Default `True` for safety;
set to `False` for remote-code models whose checkpoints may
contain custom pickled objects.

```python
nemo_automodel.components.checkpoint.checkpointing._load_hf_checkpoint_preserving_dtype(
    model_path: str,
    weights_only: bool = True
) -> typing.Optional[dict[str, torch.Tensor]]
```

Load a HuggingFace checkpoint into a new state dict so tensor dtypes
match the checkpoint (e.g. bf16). Used when loading the base model so FSDP sees
uniform dtype instead of the model's init dtypes (e.g. float32).
Prefers safetensors but falls back to .bin files.
Returns None if no loadable checkpoint is found.

**Parameters:**

Path to checkpoint file or directory.

Forwarded to `torch.load` when loading `.bin` files.

```python
nemo_automodel.components.checkpoint.checkpointing._load_hf_safetensors_checkpoint(
    model_path: str
) -> typing.Optional[dict[str, torch.Tensor]]
```

Load a safetensors checkpoint into a state dict.

```python
nemo_automodel.components.checkpoint.checkpointing._load_safetensors(
    path: str
) -> dict[str, torch.Tensor]
```

Read a safetensors file from a local path or an `msc://` cloud path.

```python
nemo_automodel.components.checkpoint.checkpointing._materialize_to_hf_views_for_save(
    state_dict: dict[str, torch.Tensor]
) -> None
```

Replace non-contiguous tensor values in `state_dict` with contiguous copies in place.

MoE adapters return non-contiguous strided views into the model's grouped
expert storage for the optimized load path; `safetensors.torch.save`
(which the DCP HF storage writer calls) rejects non-contiguous tensors,
so we materialize one tensor at a time here with `empty_cache` between
iterations. Per-tensor transient is bounded to a single expert weight
instead of allocating the full grouped set up front.

```python
nemo_automodel.components.checkpoint.checkpointing._maybe_adapt_state_dict_from_hf(
    model_part: torch.nn.Module,
    state_dict: dict[str, torch.Tensor],
    moe_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None
) -> dict[str, torch.Tensor]
```

Custom models use state dict adapters to convert the state dict from the Hugging Face format to the native format.

```python
nemo_automodel.components.checkpoint.checkpointing._maybe_adapt_state_dict_to_hf(
    model_part: torch.nn.Module,
    state_dict: dict[str, torch.Tensor],
    quantization: bool = False,
    kwargs = {}
) -> dict[str, torch.Tensor]
```

Custom models use state dict adapters to convert the state dict to the Hugging Face format.

```python
nemo_automodel.components.checkpoint.checkpointing._maybe_msc_reader(
    path: str,
    storage_reader: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader]
) -> typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader]
```

Return an MSC filesystem reader for `msc://` paths, else the given reader.

```python
nemo_automodel.components.checkpoint.checkpointing._maybe_msc_writer(
    path: str,
    storage_writer: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageWriter]
) -> typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageWriter]
```

Return an MSC filesystem writer for `msc://` paths, else the given writer.

```python
nemo_automodel.components.checkpoint.checkpointing._model_has_dtensors(
    module: torch.nn.Module
) -> bool
```

True if any parameter is a DTensor (model is already sharded).

```python
nemo_automodel.components.checkpoint.checkpointing._normalize_dtype_mapping_to_state_dict_keys(
    fqn_to_dtype_mapping: dict[str, str],
    state_dict_keys: list[str],
    base_model_prefix: str | None = None
) -> dict[str, str]
```

Align original HF dtype metadata with the keys that will be exported.

```python
nemo_automodel.components.checkpoint.checkpointing._reinit_non_persistent_buffers(
    model: torch.nn.Module,
    device: torch.device,
    model_type: str | None = None
) -> None
```

Recompute non-persistent buffers that are not saved in checkpoints.

Non-persistent buffers are not saved in checkpoints, so after meta-device
materialization they contain uninitialized CUDA memory.  When
`initialize_weights()` is skipped (e.g. for Gemma3 to avoid DTensor
issues), these buffers must be recomputed explicitly.

Only runs for models listed in `_MODELS_REQUIRING_BUFFER_REINIT` to
avoid unexpected side-effects on arbitrary HF Hub models.

Handles four patterns:

1. **Standard RoPE** — single `inv_freq` buffer with `rope_init_fn` +
   `rope_kwargs` (e.g. Nemotron-NAS).
2. **Per-layer-type RoPE** — `&#123;layer_type&#125;_inv_freq` buffers via
   `compute_default_rope_parameters` (e.g. Gemma3RotaryEmbedding).
3. **Scaled embedding** — `embed_scale` buffer on `ScaledWordEmbedding`
   modules (Gemma family), recomputed from `scalar_embed_scale`.
4. **Vision position IDs** — `position_ids` buffer on vision embedding
   modules (SigLIP), recomputed from `num_positions`.

**Parameters:**

Model to reinitialize non-persistent buffers for.

Device to create the new buffers on.

The `config.model_type` string.  If not in
`_MODELS_REQUIRING_BUFFER_REINIT` the function is a no-op.

```python
nemo_automodel.components.checkpoint.checkpointing._save_safetensors(
    state_dict: dict[str, torch.Tensor],
    path: str
) -> None
```

Write a safetensors file to a local path or an `msc://` cloud path.

For cloud paths the tensors are serialized to bytes and streamed to the MSC
file handle, since `save_file` only accepts a local filesystem path.

```python
nemo_automodel.components.checkpoint.checkpointing._should_write_consolidated_safetensors(
    config: nemo_automodel.components.checkpoint.config.CheckpointingConfig,
    is_final_checkpoint: bool = False
) -> bool
```

Whether to output consolidated HF weights along with sharded weights.

```python
nemo_automodel.components.checkpoint.checkpointing._should_write_hf_metadata(
    config: nemo_automodel.components.checkpoint.config.CheckpointingConfig
) -> bool
```

Whether to write HF metadata/artifacts for a checkpoint.

```python
nemo_automodel.components.checkpoint.checkpointing._summarize_state_dict_key_diff(
    expected_keys: set[str],
    loaded_keys: set[str],
    limit: int = 10
) -> dict[str, typing.Any]
```

Summarize state-dict key mismatches for checkpoint load diagnostics.

```python
nemo_automodel.components.checkpoint.checkpointing._warn_if_inline_consolidation_enabled(
    config: nemo_automodel.components.checkpoint.config.CheckpointingConfig
) -> None
```

Educate users about the cost of inline HF consolidation.

```python
nemo_automodel.components.checkpoint.checkpointing._warn_if_large_inline_consolidation(
    config: nemo_automodel.components.checkpoint.config.CheckpointingConfig,
    state_dict: dict[str, torch.Tensor],
    fqn_to_index_mapping: typing.Optional[dict[str, int]],
    is_final_checkpoint: bool = False
) -> None
```

Warn when inline consolidated export is large enough to waste GPU allocation time.

```python
nemo_automodel.components.checkpoint.checkpointing.is_cloud_path(
    path: str
) -> bool
```

Check if path is a cloud storage path (MSC).

```python
nemo_automodel.components.checkpoint.checkpointing.save_config(
    config: dict[str, typing.Any],
    weights_path: str
) -> None
```

Save a config to a weights path.

**Parameters:**

Config to save

Path to save config

```python
nemo_automodel.components.checkpoint.checkpointing.to_empty_parameters_only(
    model: torch.nn.Module,
    device: torch.device,
    recurse: bool = True,
    dtype: torch.dtype | None = None
) -> torch.nn.Module
```

Move parameters to the specified device without copying storage, skipping buffers.

Mirrors torch.nn.Module.to\_empty but applies only to parameters, not buffers.

**Parameters:**

The module to transform

Target device

Whether to recurse into child modules

**Returns:** `nn.Module`

The same module instance

```python
nemo_automodel.components.checkpoint.checkpointing.MSC_AVAILABLE = True
```

```python
nemo_automodel.components.checkpoint.checkpointing._CONSOLIDATED_SIZE_WARNING_THRESHOLD_BYTES = 50 * 1024 ** 3
```

```python
nemo_automodel.components.checkpoint.checkpointing._DEFAULT_HF_CONSOLIDATED_SHARD_SIZE_BYTES = 5 * 1024 ** 3
```

```python
nemo_automodel.components.checkpoint.checkpointing._MODELS_REQUIRING_BUFFER_REINIT: frozenset[str] = frozenset({'gemma3', 'nemotron-nas'})
```

```python
nemo_automodel.components.checkpoint.checkpointing.logger = logging.getLogger(__name__)
```