nemo_automodel.components.checkpoint.checkpointing

Module Contents

Classes

Name	Description
`Checkpointer`	High-level checkpoint manager built on torch.distributed.checkpoint (DCP).
`_AsyncSaveContext`	Internal container for async checkpointing state.

Functions

Name	Description
`_adapter_path`	Return the PEFT adapter safetensors path inside a checkpoint dir (local or `msc://`).
`_apply`	Apply a transformation function to parameters (and gradients) only.
`_apply_key_mapping`	Rename state-dict keys using regex-based `key_mapping`.
`_convert_checkpoint_with_transformers`	Convert a checkpoint using transformers’ conversion mapping for models that need tensor merging.
`_divide_keys_by_size`	Assign keys to deterministic size-based shards.
`_ensure_dirs`	Create directories on all ranks and synchronize across ranks.
`_ensure_msc_available`	Raise an error if MSC is not installed but a cloud path is used.
`_equally_divide_layers`	Equally divide the state dict keys into num_shards shards.
`_get_checkpoint_metadata_keys`	Return checkpoint FQNs present in metadata.
`_get_hf_safetensors_reference_path`	Return the local HF safetensors reference directory for a model.
`_get_original_hf_index_total_size`	Return the original HF safetensors index total size, if available.
`_init_peft_adapters`	Initialize the PEFT adapters with the scaled weights.
`_is_bin_checkpoint`	Return True if path looks like a PyTorch .bin checkpoint.
`_is_custom_model`	True if the model has a custom implementation in nemo_automodel/components/models/.
`_is_remote_code_model`	True if the model was loaded with trust_remote_code (HF dynamic modules).
`_is_safetensors_checkpoint`	Return True if path looks like a safetensors checkpoint (so we can preserve dtype); else DCP or other.
`_load_full_state_dict_into_model`	Load a full (non-sharded) state dict into a potentially FSDP-wrapped model.
`_load_hf_bin_checkpoint`	Load a HuggingFace .bin checkpoint into a state dict.
`_load_hf_checkpoint_preserving_dtype`	Load a HuggingFace checkpoint into a new state dict so tensor dtypes
`_load_hf_safetensors_checkpoint`	Load a safetensors checkpoint into a state dict.
`_load_safetensors`	Read a safetensors file from a local path or an `msc://` cloud path.
`_materialize_to_hf_views_for_save`	Replace non-contiguous tensor values in `state_dict` with contiguous copies in place.
`_maybe_adapt_state_dict_from_hf`	Custom models use state dict adapters to convert the state dict from the Hugging Face format to the native format.
`_maybe_adapt_state_dict_to_hf`	Custom models use state dict adapters to convert the state dict to the Hugging Face format.
`_maybe_msc_reader`	Return an MSC filesystem reader for `msc://` paths, else the given reader.
`_maybe_msc_writer`	Return an MSC filesystem writer for `msc://` paths, else the given writer.
`_model_has_dtensors`	True if any parameter is a DTensor (model is already sharded).
`_normalize_dtype_mapping_to_state_dict_keys`	Align original HF dtype metadata with the keys that will be exported.
`_reinit_non_persistent_buffers`	Recompute non-persistent buffers that are not saved in checkpoints.
`_save_safetensors`	Write a safetensors file to a local path or an `msc://` cloud path.
`_should_write_consolidated_safetensors`	Whether to output consolidated HF weights along with sharded weights.
`_should_write_hf_metadata`	Whether to write HF metadata/artifacts for a checkpoint.
`_summarize_state_dict_key_diff`	Summarize state-dict key mismatches for checkpoint load diagnostics.
`_warn_if_inline_consolidation_enabled`	Educate users about the cost of inline HF consolidation.
`_warn_if_large_inline_consolidation`	Warn when inline consolidated export is large enough to waste GPU allocation time.
`is_cloud_path`	Check if path is a cloud storage path (MSC).
`save_config`	Save a config to a weights path.
`to_empty_parameters_only`	Move parameters to the specified device without copying storage, skipping buffers.

Data

MSC_AVAILABLE

_CONSOLIDATED_SIZE_WARNING_THRESHOLD_BYTES

_DEFAULT_HF_CONSOLIDATED_SHARD_SIZE_BYTES

_MODELS_REQUIRING_BUFFER_REINIT

logger

API

class nemo_automodel.components.checkpoint.checkpointing.Checkpointer(
    config: nemo_automodel.components.checkpoint.config.CheckpointingConfig,
    dp_rank: int,
    tp_rank: int,
    pp_rank: int,
    moe_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None
)

High-level checkpoint manager built on torch.distributed.checkpoint (DCP).

Supports:

HF sharded safetensors via custom storage reader/writer
Optional consolidated export (config, generation config, tokenizer)
PEFT adapter save/load handling
Async save for torch >= 2.9.0

Also provides DP-aware helpers for saving/loading auxiliary state and utilities to initialize from a base HF checkpoint.

_addons

= []

_model_ctx

_optim_ctx

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._do_load(
    state_dict: dict[str, torch.Tensor],
    path: str,
    storage_reader: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader] = None,
    is_init_step: bool = False
) -> dict[str, torch.Tensor]

Load a state dictionary from path using DCP or PEFT special-case logic.

Parameters:

state_dict

dict[str, torch.Tensor]

Mutable state dict to populate with tensors.

path

str

Checkpoint directory path.

storage_reader

Optional[_HuggingFaceStorageReader]Defaults to None

Optional HF storage reader for safetensors.

is_init_step

boolDefaults to False

True if loading from a base checkpoint during initialization.

Returns: dict[str, torch.Tensor]

The populated state dictionary (may be replaced for PEFT).

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._do_save(
    state_dict: dict[str, torch.Tensor],
    path: str,
    storage_writer: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageWriter] = None
) -> typing.Optional[torch.distributed.checkpoint.state_dict_saver.AsyncSaveResponse]

Save a state dictionary to path using DCP or PEFT special-case logic.

For PEFT model saves: only rank 0 writes adapter_model.safetensors.
If async mode is enabled, schedule an asynchronous save.

Parameters:

state_dict

dict[str, torch.Tensor]

State dict to be serialized.

path

str

Checkpoint directory path.

storage_writer

Optional[_HuggingFaceStorageWriter]Defaults to None

Optional HF storage writer for safetensors sharding.

Returns: Optional[AsyncSaveResponse]

Optional Future object if async mode is enabled.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._get_original_model_path(
    model_state: nemo_automodel.components.checkpoint.stateful_wrappers.ModelState
) -> str | None

Get the path to the original model from the Hugging Face checkpoint.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._get_storage_reader(
    model_path: str,
    key_mapping: typing.Optional[dict[str, str]],
    is_init_step: bool = False,
    is_safetensors: bool | None = None
) -> typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader]

Construct a Hugging Face storage reader when loading safetensors or during init.

Prefers the upstream torch.distributed.checkpoint.hf_storage.HuggingFaceStorageReader when no key_mapping is needed, since it uses safetensors’ native get_slice() for efficient partial reads (only the bytes for the local DTensor shard are read from disk). Falls back to the backported reader when key_mapping is required or when the upstream reader is not available.

Parameters:

model_path

str

Path to the model checkpoint directory or HF snapshot.

key_mapping

Optional[dict[str, str]]

Optional key remapping for conversion.

is_init_step

boolDefaults to False

If True, always produce a reader for base HF load.

is_safetensors

bool | NoneDefaults to None

Whether model_path holds a safetensors checkpoint; computed from the directory contents when not supplied.

Returns: Optional[_HuggingFaceStorageReader]

Configured storage reader, or None for the default DCP FileSystemReader.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._get_storage_writer(
    consolidated_output_path: typing.Optional[str],
    fqn_to_index_mapping: typing.Optional[dict[str, int]],
    fqn_to_dtype_mapping: typing.Optional[dict[str, str]],
    model_path: str,
    consolidate_on_all_ranks: bool = False
) -> typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageWriter]

Construct a Hugging Face storage writer for sharded safetensors.

Parameters:

consolidated_output_path

Optional[str]

Optional path for consolidated artifacts.

fqn_to_index_mapping

Optional[dict[str, int]]

Optional mapping from FQN to shard index.

fqn_to_dtype_mapping

Optional[dict[str, str]]

Optional mapping from FQN to original HF safetensors dtype string.

model_path

str

Path where the model checkpoint is saved.

consolidate_on_all_ranks

boolDefaults to False

If True, consolidate on all ranks on the main process.

Returns: Optional[_HuggingFaceStorageWriter]

Configured _HuggingFaceStorageWriter or None for non-safetensors.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._maybe_build_consolidated_index(
    model_state: nemo_automodel.components.checkpoint.stateful_wrappers.ModelState,
    state_dict: dict[str, torch.Tensor]
) -> typing.Optional[dict[str, int]]

Build FQN to shard index mapping for consolidated HF export.

Uses the base checkpoint index (if present), removes non-persistent keys, and assigns new keys to the last shard by default.

Parameters:

model_state

ModelState

Wrapper exposing the primary model part.

state_dict

dict[str, torch.Tensor]

The state dict that will be saved.

Returns: Optional[dict[str, int]]

Mapping from FQN to shard index, or None when not consolidating.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._maybe_build_original_dtype_mapping(
    model_state: nemo_automodel.components.checkpoint.stateful_wrappers.ModelState,
    state_dict: dict[str, torch.Tensor]
) -> typing.Optional[dict[str, str]]

Build FQN to original HF safetensors dtype mapping for consolidated export.

Returns None when the run started from config-only weights or the original HF safetensors headers are not available. In that case consolidation keeps the saved checkpoint dtype unless the user explicitly passes CAST_DTYPE to the offline helper.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._maybe_log_final_offline_consolidation_hint(
    model_dir: str,
    is_final_checkpoint: bool = False
) -> None

Log the final-checkpoint helper hint when consolidated export was disabled.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._maybe_write_offline_consolidation_script(
    model_dir: str
) -> None

Write a conservative helper script for offline HF safetensors consolidation.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.async_wait() -> None

Wait for the async save to finish.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.close() -> None

Close the checkpointer.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.initialize_model_weights(
    model: torch.nn.Module,
    device: torch.device,
    peft_init_method: str | None = None
) -> None

staticmethod

Materialize meta-device parameters and initialize model weights.

Moves empty parameter shells to the target device, resets HF initialization flags, calls the model’s weight initialization method, and initializes any PEFT adapters.

Parameters:

model

torch.nn.Module

Model whose weights should be initialized.

device

torch.device

Target device for materialized parameters.

peft_init_method

str | NoneDefaults to None

Initialization method for PEFT adapters (e.g. “xavier”).

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_base_model(
    model: torch.nn.Module,
    device: torch.device,
    root_dir: str,
    model_name: str | None,
    load_base_model: bool = True
) -> None

Load a model from the base Hugging Face checkpoint in parallel.

Parameters:

model

torch.nn.Module

Model to load state into

device

torch.device

Device to load model onto

root_dir

str

Root directory of the model cache or snapshots

model_name

str | None

Name of the model or an absolute path to a snapshot

load_base_model

boolDefaults to True

If True, restore from HF base checkpoint

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_distributed_state(
    state: typing.Any,
    state_name: str,
    path: str
) -> None

Load a custom stateful object previously saved with DCP.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_model(
    model: torch.nn.Module,
    model_path: str,
    is_init_step: bool = False,
    use_checkpoint_id: bool = True,
    key_mapping: typing.Optional[dict[str, str]] = None,
    allow_checkpoint_key_subset: bool = False
) -> None

Load model weights from model_path.

Behavior:

For PEFT (non-init): rank 0 reads adapter_model.safetensors, then broadcasts.
Otherwise: use DCP with a Hugging Face or default storage reader to populate the state dict.
If the model exposes a state_dict_adapter, convert to/from HF format as needed.
For models requiring tensor merging (e.g., Mixtral), uses transformers’ conversion mapping.

Parameters:

model

nn.Module

Model or parallelized model parts to load into.

model_path

str

Path to the model checkpoint directory or HF snapshot.

is_init_step

boolDefaults to False

If True, treat load as initialization from a base checkpoint.

use_checkpoint_id

boolDefaults to True

Pass checkpoint_id to DCP if True; disable when using direct HF paths.

key_mapping

Optional[dict[str, str]]Defaults to None

Optional key remapping when reading from HF checkpoints.

allow_checkpoint_key_subset

boolDefaults to False

If True, keep the model’s current initialization for parameters that are absent from the checkpoint instead of requiring an exact key match.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_on_dp_ranks(
    state: typing.Any,
    state_name: str,
    path: str
) -> None

Load the stateful object.

This function is a helper function currently used to load the dataloader and rng state.

Parameters:

state

Any

Stateful object to load

state_name

str

Name of the stateful object

path

str

Path to load stateful object

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_optimizer(
    optimizer: torch.optim.Optimizer,
    model: torch.nn.Module,
    weights_path: str,
    scheduler: typing.Optional[typing.Any] = None
) -> None

Load optimizer (and optional scheduler) state from weights_path/optim using DCP.

Parameters:

optimizer

torch.optim.Optimizer

Optimizer to populate.

model

nn.Module

Model providing partitioning context for the optimizer wrapper.

weights_path

str

Base directory for checkpoints.

scheduler

Optional[Any]Defaults to None

Optional LR scheduler to populate.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.maybe_wait_for_staging() -> None

Wait for the staging to finish if it is enabled.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.save_distributed_state(
    state: typing.Any,
    state_name: str,
    path: str
) -> None

Save a custom stateful object through DCP on all ranks.

This is intended for auxiliary objects whose state dict contains sharded tensors, for example BAGEL EMA shadows under FSDP2. Rank-0 torch.save would only persist rank 0’s local shard; DCP sees the DTensor metadata and writes all shards correctly.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.save_model(
    model: torch.nn.Module,
    weights_path: str,
    peft_config: typing.Optional[peft.PeftConfig] = None,
    tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None,
    is_final_checkpoint: bool = False
) -> None

Save model weights to weights_path/model.

Behavior:

PEFT: write adapter_model.safetensors and metadata on rank 0.
Safetensors + consolidation: emit HF artifacts under weights_path/model/consolidated and build a consolidated index.
Otherwise: use DCP with a Hugging Face or default storage writer to save shards.

Parameters:

model

nn.Module

Model to checkpoint.

weights_path

str

Base directory for checkpoints.

peft_config

Optional[PeftConfig]Defaults to None

Optional PEFT configuration when saving adapters.

tokenizer

Optional[PreTrainedTokenizerBase]Defaults to None

Optional tokenizer to save with consolidated artifacts.

is_final_checkpoint

boolDefaults to False

Whether this save is the final scheduled training checkpoint.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.save_on_dp_ranks(
    state: typing.Any,
    state_name: str,
    path: str
) -> None

Save the stateful object.

This function is a helper function currently used to save the dataloader and rng state.

Parameters:

state

Any

Stateful object to save

state_name

str

Name of the stateful object

path

str

Path to save stateful object

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.save_optimizer(
    optimizer: torch.optim.Optimizer,
    model: torch.nn.Module,
    weights_path: str,
    scheduler: typing.Optional[typing.Any] = None
) -> None

Save optimizer (and optional scheduler) state to weights_path/optim using DCP.

Parameters:

optimizer

torch.optim.Optimizer

Optimizer whose state will be saved.

model

nn.Module

Model providing partitioning context for the optimizer wrapper.

weights_path

str

Base directory for checkpoints.

scheduler

Optional[Any]Defaults to None

Optional LR scheduler to include.

class nemo_automodel.components.checkpoint.checkpointing._AsyncSaveContext(
    stager: typing.Any | None,
    process_group: typing.Any | None,
    future: typing.Any | None,
    staging_active: bool = False
)

Dataclass

Internal container for async checkpointing state.

One instance is maintained for the model save and one for the optimizer save to keep staging/upload futures and the associated process group and stager together in a single place.

future

Any | None

process_group

Any | None

stager

Any | None

staging_active

bool = False

nemo_automodel.components.checkpoint.checkpointing._adapter_path(
    checkpoint_dir: str
) -> str

Return the PEFT adapter safetensors path inside a checkpoint dir (local or msc://).

nemo_automodel.components.checkpoint.checkpointing._apply(
    module,
    fn,
    recurse = True
) -> torch.nn.Module

Apply a transformation function to parameters (and gradients) only.

Mirrors nn.Module.to_empty for parameters while skipping buffers. Respects future flags controlling in-place vs swap behavior and safely handles wrapper subclasses.

Parameters:

module

Module whose parameters are to be transformed.

Callable applied to each parameter (and its gradient).

recurse

Defaults to True

Whether to recurse into child modules.

Returns: nn.Module

The same module instance after transformation.

nemo_automodel.components.checkpoint.checkpointing._apply_key_mapping(
    state_dict: dict[str, torch.Tensor],
    key_mapping: dict[str, str]
) -> dict[str, torch.Tensor]

Rename state-dict keys using regex-based key_mapping.

This mirrors the renaming logic used by the DCP / HuggingFace storage reader but operates directly on an in-memory state dict. It is needed when loading safetensors checkpoints outside of DCP so that HF checkpoint keys (e.g. language_model.model.X) are translated to the model’s parameter FQNs (e.g. model.language_model.X).

Parameters:

state_dict

dict[str, torch.Tensor]

Original state dict whose keys may need renaming.

key_mapping

dict[str, str]

{regex_pattern: replacement} pairs applied in order.

Returns: dict[str, torch.Tensor]

A new dict with renamed keys.

nemo_automodel.components.checkpoint.checkpointing._convert_checkpoint_with_transformers(
    model: torch.nn.Module,
    model_path: str,
    key_mapping: typing.Optional[dict[str, str]] = None
) -> typing.Optional[dict[str, torch.Tensor]]

Convert a checkpoint using transformers’ conversion mapping for models that need tensor merging.

This handles MoE models like Mixtral where the checkpoint has individual expert weights but the model uses grouped expert tensors. The transformers library’s WeightConverter operations handle the tensor merging (MergeModulelist, Concatenate).

This function converts the state dict WITHOUT loading it into the model, so it can be used with FSDP-aware loading mechanisms.

Parameters:

model

nn.Module

The model (used to get conversion mapping and target keys).

model_path

str

Path to the HuggingFace checkpoint directory.

key_mapping

Optional[dict[str, str]]Defaults to None

Optional additional key mapping.

Returns: Optional[dict[str, torch.Tensor]]

Converted state dict ready for loading, or None if conversion failed.

nemo_automodel.components.checkpoint.checkpointing._divide_keys_by_size(
    keys: list[str],
    state_dict: dict[str, torch.Tensor],
    target_shard_bytes: int
) -> dict[str, int]

Assign keys to deterministic size-based shards.

nemo_automodel.components.checkpoint.checkpointing._ensure_dirs(
    dirs: typing.Optional[str] = ()
) -> None

Create directories on all ranks and synchronize across ranks.

Parameters:

*dirs

Optional[str]Defaults to ()

One or more directory paths that should exist.

nemo_automodel.components.checkpoint.checkpointing._ensure_msc_available() -> None

Raise an error if MSC is not installed but a cloud path is used.

nemo_automodel.components.checkpoint.checkpointing._equally_divide_layers(
    num_shards: int,
    keys: list[str]
) -> dict[str, int]

Equally divide the state dict keys into num_shards shards.

nemo_automodel.components.checkpoint.checkpointing._get_checkpoint_metadata_keys(
    path: str,
    storage_reader: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader] = None
) -> set[str]

Return checkpoint FQNs present in metadata.

nemo_automodel.components.checkpoint.checkpointing._get_hf_safetensors_reference_path(
    cache_dir: str | pathlib.Path | None,
    repo_id: str | None
) -> str | None

Return the local HF safetensors reference directory for a model.

Prefer the snapshot directory containing model.safetensors.index.json for sharded checkpoints. If no index exists but a snapshot directory is present, return that directory as the single-file safetensors reference path. Return None when repo_id is None or the repo has no cached snapshot directory.

For example, if the located file is

/opt/models/models—meta-llama—Llama-3.2-3B/snapshots/13afe…/model.safetensors.index.json

this function will return the directory path

/opt/models/models—meta-llama—Llama-3.2-3B/snapshots/13afe…

This will error if the model hasn’t been downloaded or if the cache directory is incorrect.

Parameters:

cache_dir

str | Path | None

Path to cache directory

repo_id

str | None

Hugging Face repository ID

Returns: str | None

Path to the snapshot/model directory containing safetensors weights, or

nemo_automodel.components.checkpoint.checkpointing._get_original_hf_index_total_size(
    config: nemo_automodel.components.checkpoint.config.CheckpointingConfig
) -> int | None

Return the original HF safetensors index total size, if available.

nemo_automodel.components.checkpoint.checkpointing._init_peft_adapters(
    model: torch.nn.Module,
    peft_init_method: str
) -> None

Initialize the PEFT adapters with the scaled weights.

Parameters:

model

nn.Module

Model to initialize PEFT adapters for

peft_init_method

str

Method to initialize PEFT adapters e.g. “xavier”. See LinearLoRA for more details.

nemo_automodel.components.checkpoint.checkpointing._is_bin_checkpoint(
    path: str
) -> bool

Return True if path looks like a PyTorch .bin checkpoint.

nemo_automodel.components.checkpoint.checkpointing._is_custom_model(
    module: torch.nn.Module
) -> bool

True if the model has a custom implementation in nemo_automodel/components/models/.

The generic HFCheckpointingMixin (in .common.hf_checkpointing_mixin) is injected into every model by _get_mixin_wrapped_class and does NOT count as a “custom model”. Only actual model implementations (e.g. llama, deepseek_v3) that live under nemo_automodel.components.models qualify.

nemo_automodel.components.checkpoint.checkpointing._is_remote_code_model(
    module: torch.nn.Module
) -> bool

True if the model was loaded with trust_remote_code (HF dynamic modules).

nemo_automodel.components.checkpoint.checkpointing._is_safetensors_checkpoint(
    path: str
) -> bool

Return True if path looks like a safetensors checkpoint (so we can preserve dtype); else DCP or other.

nemo_automodel.components.checkpoint.checkpointing._load_full_state_dict_into_model(
    model_parts: list[torch.nn.Module],
    state_dict: dict[str, torch.Tensor]
) -> None

Load a full (non-sharded) state dict into a potentially FSDP-wrapped model.

Every rank must supply the full state dict. PyTorch’s set_model_state_dict with full_state_dict=True (but not broadcast_from_rank0) calls _distribute_state_dict which lets each rank independently slice its local DTensor shard from the full tensor — no NCCL collectives are needed.

We intentionally avoid broadcast_from_rank0=True because it introduces an asymmetric workload: rank 0 does a synchronous CPU→GPU copy (.to(device)) per tensor while other ranks only do torch.empty (async allocation). The non-src ranks race ahead enqueuing hundreds of NCCL broadcasts that rank 0 cannot keep up with, leading to a 60 s NCCL watchdog timeout.

After loading, floating-point parameters are converted to match the checkpoint dtype. PyTorch’s set_model_state_dict uses copy semantics (assign=False) for non-meta parameters, which preserves the model’s initialisation dtype instead of the checkpoint dtype. The post-load fixup ensures the safetensors dtype (e.g. bf16) is honoured.

Parameters:

model_parts

list[nn.Module]

List of model parts (for pipeline parallelism)

state_dict

dict[str, torch.Tensor]

Full state dict with regular tensors. Must be populated on every rank (not just rank 0).

nemo_automodel.components.checkpoint.checkpointing._load_hf_bin_checkpoint(
    model_path: str,
    weights_only: bool = True
) -> typing.Optional[dict[str, torch.Tensor]]

Load a HuggingFace .bin checkpoint into a state dict.

Handles single-file (pytorch_model.bin), sharded (pytorch_model.bin.index.json), and glob fallback (*.bin) layouts. Returns None if no .bin files are found.

Parameters:

model_path

str

Path to checkpoint file or directory.

weights_only

boolDefaults to True

Passed to torch.load. Default True for safety; set to False for remote-code models whose checkpoints may contain custom pickled objects.

nemo_automodel.components.checkpoint.checkpointing._load_hf_checkpoint_preserving_dtype(
    model_path: str,
    weights_only: bool = True
) -> typing.Optional[dict[str, torch.Tensor]]

Load a HuggingFace checkpoint into a new state dict so tensor dtypes match the checkpoint (e.g. bf16). Used when loading the base model so FSDP sees uniform dtype instead of the model’s init dtypes (e.g. float32). Prefers safetensors but falls back to .bin files. Returns None if no loadable checkpoint is found.

Parameters:

model_path

str

Path to checkpoint file or directory.

weights_only

boolDefaults to True

Forwarded to torch.load when loading .bin files.

nemo_automodel.components.checkpoint.checkpointing._load_hf_safetensors_checkpoint(
    model_path: str
) -> typing.Optional[dict[str, torch.Tensor]]

Load a safetensors checkpoint into a state dict.

nemo_automodel.components.checkpoint.checkpointing._load_safetensors(
    path: str
) -> dict[str, torch.Tensor]

Read a safetensors file from a local path or an msc:// cloud path.

nemo_automodel.components.checkpoint.checkpointing._materialize_to_hf_views_for_save(
    state_dict: dict[str, torch.Tensor]
) -> None

Replace non-contiguous tensor values in state_dict with contiguous copies in place.

MoE adapters return non-contiguous strided views into the model’s grouped expert storage for the optimized load path; safetensors.torch.save (which the DCP HF storage writer calls) rejects non-contiguous tensors, so we materialize one tensor at a time here with empty_cache between iterations. Per-tensor transient is bounded to a single expert weight instead of allocating the full grouped set up front.

nemo_automodel.components.checkpoint.checkpointing._maybe_adapt_state_dict_from_hf(
    model_part: torch.nn.Module,
    state_dict: dict[str, torch.Tensor],
    moe_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None
) -> dict[str, torch.Tensor]

Custom models use state dict adapters to convert the state dict from the Hugging Face format to the native format.

nemo_automodel.components.checkpoint.checkpointing._maybe_adapt_state_dict_to_hf(
    model_part: torch.nn.Module,
    state_dict: dict[str, torch.Tensor],
    quantization: bool = False,
    kwargs = {}
) -> dict[str, torch.Tensor]

Custom models use state dict adapters to convert the state dict to the Hugging Face format.

nemo_automodel.components.checkpoint.checkpointing._maybe_msc_reader(
    path: str,
    storage_reader: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader]
) -> typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader]

Return an MSC filesystem reader for msc:// paths, else the given reader.

nemo_automodel.components.checkpoint.checkpointing._maybe_msc_writer(
    path: str,
    storage_writer: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageWriter]
) -> typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageWriter]

Return an MSC filesystem writer for msc:// paths, else the given writer.

nemo_automodel.components.checkpoint.checkpointing._model_has_dtensors(
    module: torch.nn.Module
) -> bool

True if any parameter is a DTensor (model is already sharded).

nemo_automodel.components.checkpoint.checkpointing._normalize_dtype_mapping_to_state_dict_keys(
    fqn_to_dtype_mapping: dict[str, str],
    state_dict_keys: list[str],
    base_model_prefix: str | None = None
) -> dict[str, str]

Align original HF dtype metadata with the keys that will be exported.

nemo_automodel.components.checkpoint.checkpointing._reinit_non_persistent_buffers(
    model: torch.nn.Module,
    device: torch.device,
    model_type: str | None = None
) -> None

Recompute non-persistent buffers that are not saved in checkpoints.

Non-persistent buffers are not saved in checkpoints, so after meta-device materialization they contain uninitialized CUDA memory. When initialize_weights() is skipped (e.g. for Gemma3 to avoid DTensor issues), these buffers must be recomputed explicitly.

Only runs for models listed in _MODELS_REQUIRING_BUFFER_REINIT to avoid unexpected side-effects on arbitrary HF Hub models.

Handles four patterns:

Standard RoPE — single inv_freq buffer with rope_init_fn + rope_kwargs (e.g. Nemotron-NAS).
Per-layer-type RoPE — {layer_type}_inv_freq buffers via compute_default_rope_parameters (e.g. Gemma3RotaryEmbedding).
Scaled embedding — embed_scale buffer on ScaledWordEmbedding modules (Gemma family), recomputed from scalar_embed_scale.
Vision position IDs — position_ids buffer on vision embedding modules (SigLIP), recomputed from num_positions.

Parameters:

model

nn.Module

Model to reinitialize non-persistent buffers for.

device

torch.device

Device to create the new buffers on.

model_type

str | NoneDefaults to None

The config.model_type string. If not in _MODELS_REQUIRING_BUFFER_REINIT the function is a no-op.

nemo_automodel.components.checkpoint.checkpointing._save_safetensors(
    state_dict: dict[str, torch.Tensor],
    path: str
) -> None

Write a safetensors file to a local path or an msc:// cloud path.

For cloud paths the tensors are serialized to bytes and streamed to the MSC file handle, since save_file only accepts a local filesystem path.

nemo_automodel.components.checkpoint.checkpointing._should_write_consolidated_safetensors(
    config: nemo_automodel.components.checkpoint.config.CheckpointingConfig,
    is_final_checkpoint: bool = False
) -> bool

Whether to output consolidated HF weights along with sharded weights.

nemo_automodel.components.checkpoint.checkpointing._should_write_hf_metadata(
    config: nemo_automodel.components.checkpoint.config.CheckpointingConfig
) -> bool

Whether to write HF metadata/artifacts for a checkpoint.

nemo_automodel.components.checkpoint.checkpointing._summarize_state_dict_key_diff(
    expected_keys: set[str],
    loaded_keys: set[str],
    limit: int = 10
) -> dict[str, typing.Any]

Summarize state-dict key mismatches for checkpoint load diagnostics.

nemo_automodel.components.checkpoint.checkpointing._warn_if_inline_consolidation_enabled(
    config: nemo_automodel.components.checkpoint.config.CheckpointingConfig
) -> None

Educate users about the cost of inline HF consolidation.

nemo_automodel.components.checkpoint.checkpointing._warn_if_large_inline_consolidation(
    config: nemo_automodel.components.checkpoint.config.CheckpointingConfig,
    state_dict: dict[str, torch.Tensor],
    fqn_to_index_mapping: typing.Optional[dict[str, int]],
    is_final_checkpoint: bool = False
) -> None

Warn when inline consolidated export is large enough to waste GPU allocation time.

nemo_automodel.components.checkpoint.checkpointing.is_cloud_path(
    path: str
) -> bool

Check if path is a cloud storage path (MSC).

nemo_automodel.components.checkpoint.checkpointing.save_config(
    config: dict[str, typing.Any],
    weights_path: str
) -> None

Save a config to a weights path.

Parameters:

config

dict[str, Any]

Config to save

weights_path

str

Path to save config

nemo_automodel.components.checkpoint.checkpointing.to_empty_parameters_only(
    model: torch.nn.Module,
    device: torch.device,
    recurse: bool = True,
    dtype: torch.dtype | None = None
) -> torch.nn.Module

Move parameters to the specified device without copying storage, skipping buffers.

Mirrors torch.nn.Module.to_empty but applies only to parameters, not buffers.

Parameters:

model

nn.Module

The module to transform

device

torch.device

Target device

recurse

boolDefaults to True

Whether to recurse into child modules

Returns: nn.Module

The same module instance

nemo_automodel.components.checkpoint.checkpointing.MSC_AVAILABLE = True

nemo_automodel.components.checkpoint.checkpointing._CONSOLIDATED_SIZE_WARNING_THRESHOLD_BYTES = 50 * 1024 ** 3

nemo_automodel.components.checkpoint.checkpointing._DEFAULT_HF_CONSOLIDATED_SHARD_SIZE_BYTES = 5 * 1024 ** 3

nemo_automodel.components.checkpoint.checkpointing._MODELS_REQUIRING_BUFFER_REINIT: frozenset[str] = frozenset({'gemma3', 'nemotron-nas'})

nemo_automodel.components.checkpoint.checkpointing.logger = logging.getLogger(__name__)