nemo_automodel.components.checkpoint.checkpointing

View as Markdown

Module Contents

Classes

NameDescription
CheckpointerHigh-level checkpoint manager built on torch.distributed.checkpoint (DCP).
_AsyncSaveContextInternal container for async checkpointing state.

Functions

NameDescription
_adapter_pathReturn the PEFT adapter safetensors path inside a checkpoint dir (local or msc://).
_applyApply a transformation function to parameters (and gradients) only.
_apply_key_mappingRename state-dict keys using regex-based key_mapping.
_convert_checkpoint_with_transformersConvert a checkpoint using transformers’ conversion mapping for models that need tensor merging.
_divide_keys_by_sizeAssign keys to deterministic size-based shards.
_ensure_dirsCreate directories on all ranks and synchronize across ranks.
_ensure_msc_availableRaise an error if MSC is not installed but a cloud path is used.
_equally_divide_layersEqually divide the state dict keys into num_shards shards.
_get_checkpoint_metadata_keysReturn checkpoint FQNs present in metadata.
_get_hf_safetensors_reference_pathReturn the local HF safetensors reference directory for a model.
_get_original_hf_index_total_sizeReturn the original HF safetensors index total size, if available.
_init_peft_adaptersInitialize the PEFT adapters with the scaled weights.
_is_bin_checkpointReturn True if path looks like a PyTorch .bin checkpoint.
_is_custom_modelTrue if the model has a custom implementation in nemo_automodel/components/models/.
_is_remote_code_modelTrue if the model was loaded with trust_remote_code (HF dynamic modules).
_is_safetensors_checkpointReturn True if path looks like a safetensors checkpoint (so we can preserve dtype); else DCP or other.
_load_full_state_dict_into_modelLoad a full (non-sharded) state dict into a potentially FSDP-wrapped model.
_load_hf_bin_checkpointLoad a HuggingFace .bin checkpoint into a state dict.
_load_hf_checkpoint_preserving_dtypeLoad a HuggingFace checkpoint into a new state dict so tensor dtypes
_load_hf_safetensors_checkpointLoad a safetensors checkpoint into a state dict.
_load_safetensorsRead a safetensors file from a local path or an msc:// cloud path.
_materialize_to_hf_views_for_saveReplace non-contiguous tensor values in state_dict with contiguous copies in place.
_maybe_adapt_state_dict_from_hfCustom models use state dict adapters to convert the state dict from the Hugging Face format to the native format.
_maybe_adapt_state_dict_to_hfCustom models use state dict adapters to convert the state dict to the Hugging Face format.
_maybe_msc_readerReturn an MSC filesystem reader for msc:// paths, else the given reader.
_maybe_msc_writerReturn an MSC filesystem writer for msc:// paths, else the given writer.
_model_has_dtensorsTrue if any parameter is a DTensor (model is already sharded).
_normalize_dtype_mapping_to_state_dict_keysAlign original HF dtype metadata with the keys that will be exported.
_reinit_non_persistent_buffersRecompute non-persistent buffers that are not saved in checkpoints.
_save_safetensorsWrite a safetensors file to a local path or an msc:// cloud path.
_should_write_consolidated_safetensorsWhether to output consolidated HF weights along with sharded weights.
_should_write_hf_metadataWhether to write HF metadata/artifacts for a checkpoint.
_summarize_state_dict_key_diffSummarize state-dict key mismatches for checkpoint load diagnostics.
_warn_if_inline_consolidation_enabledEducate users about the cost of inline HF consolidation.
_warn_if_large_inline_consolidationWarn when inline consolidated export is large enough to waste GPU allocation time.
is_cloud_pathCheck if path is a cloud storage path (MSC).
save_configSave a config to a weights path.
to_empty_parameters_onlyMove parameters to the specified device without copying storage, skipping buffers.

Data

MSC_AVAILABLE

_CONSOLIDATED_SIZE_WARNING_THRESHOLD_BYTES

_DEFAULT_HF_CONSOLIDATED_SHARD_SIZE_BYTES

_MODELS_REQUIRING_BUFFER_REINIT

logger

API

class nemo_automodel.components.checkpoint.checkpointing.Checkpointer(
config: nemo_automodel.components.checkpoint.config.CheckpointingConfig,
dp_rank: int,
tp_rank: int,
pp_rank: int,
moe_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None
)

High-level checkpoint manager built on torch.distributed.checkpoint (DCP).

Supports:

  • HF sharded safetensors via custom storage reader/writer
  • Optional consolidated export (config, generation config, tokenizer)
  • PEFT adapter save/load handling
  • Async save for torch >= 2.9.0

Also provides DP-aware helpers for saving/loading auxiliary state and utilities to initialize from a base HF checkpoint.

_addons
= []
_model_ctx
_optim_ctx
nemo_automodel.components.checkpoint.checkpointing.Checkpointer._do_load(
state_dict: dict[str, torch.Tensor],
path: str,
storage_reader: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader] = None,
is_init_step: bool = False
) -> dict[str, torch.Tensor]

Load a state dictionary from path using DCP or PEFT special-case logic.

Parameters:

state_dict
dict[str, torch.Tensor]

Mutable state dict to populate with tensors.

path
str

Checkpoint directory path.

storage_reader
Optional[_HuggingFaceStorageReader]Defaults to None

Optional HF storage reader for safetensors.

is_init_step
boolDefaults to False

True if loading from a base checkpoint during initialization.

Returns: dict[str, torch.Tensor]

The populated state dictionary (may be replaced for PEFT).

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._do_save(
state_dict: dict[str, torch.Tensor],
path: str,
storage_writer: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageWriter] = None
) -> typing.Optional[torch.distributed.checkpoint.state_dict_saver.AsyncSaveResponse]

Save a state dictionary to path using DCP or PEFT special-case logic.

  • For PEFT model saves: only rank 0 writes adapter_model.safetensors.
  • If async mode is enabled, schedule an asynchronous save.

Parameters:

state_dict
dict[str, torch.Tensor]

State dict to be serialized.

path
str

Checkpoint directory path.

storage_writer
Optional[_HuggingFaceStorageWriter]Defaults to None

Optional HF storage writer for safetensors sharding.

Returns: Optional[AsyncSaveResponse]

Optional Future object if async mode is enabled.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._get_original_model_path(
model_state: nemo_automodel.components.checkpoint.stateful_wrappers.ModelState
) -> str | None

Get the path to the original model from the Hugging Face checkpoint.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._get_storage_reader(
model_path: str,
key_mapping: typing.Optional[dict[str, str]],
is_init_step: bool = False,
is_safetensors: bool | None = None
) -> typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader]

Construct a Hugging Face storage reader when loading safetensors or during init.

Prefers the upstream torch.distributed.checkpoint.hf_storage.HuggingFaceStorageReader when no key_mapping is needed, since it uses safetensors’ native get_slice() for efficient partial reads (only the bytes for the local DTensor shard are read from disk). Falls back to the backported reader when key_mapping is required or when the upstream reader is not available.

Parameters:

model_path
str

Path to the model checkpoint directory or HF snapshot.

key_mapping
Optional[dict[str, str]]

Optional key remapping for conversion.

is_init_step
boolDefaults to False

If True, always produce a reader for base HF load.

is_safetensors
bool | NoneDefaults to None

Whether model_path holds a safetensors checkpoint; computed from the directory contents when not supplied.

Returns: Optional[_HuggingFaceStorageReader]

Configured storage reader, or None for the default DCP FileSystemReader.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._get_storage_writer(
consolidated_output_path: typing.Optional[str],
fqn_to_index_mapping: typing.Optional[dict[str, int]],
fqn_to_dtype_mapping: typing.Optional[dict[str, str]],
model_path: str,
consolidate_on_all_ranks: bool = False
) -> typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageWriter]

Construct a Hugging Face storage writer for sharded safetensors.

Parameters:

consolidated_output_path
Optional[str]

Optional path for consolidated artifacts.

fqn_to_index_mapping
Optional[dict[str, int]]

Optional mapping from FQN to shard index.

fqn_to_dtype_mapping
Optional[dict[str, str]]

Optional mapping from FQN to original HF safetensors dtype string.

model_path
str

Path where the model checkpoint is saved.

consolidate_on_all_ranks
boolDefaults to False

If True, consolidate on all ranks on the main process.

Returns: Optional[_HuggingFaceStorageWriter]

Configured _HuggingFaceStorageWriter or None for non-safetensors.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._maybe_build_consolidated_index(
model_state: nemo_automodel.components.checkpoint.stateful_wrappers.ModelState,
state_dict: dict[str, torch.Tensor]
) -> typing.Optional[dict[str, int]]

Build FQN to shard index mapping for consolidated HF export.

Uses the base checkpoint index (if present), removes non-persistent keys, and assigns new keys to the last shard by default.

Parameters:

model_state
ModelState

Wrapper exposing the primary model part.

state_dict
dict[str, torch.Tensor]

The state dict that will be saved.

Returns: Optional[dict[str, int]]

Mapping from FQN to shard index, or None when not consolidating.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._maybe_build_original_dtype_mapping(
model_state: nemo_automodel.components.checkpoint.stateful_wrappers.ModelState,
state_dict: dict[str, torch.Tensor]
) -> typing.Optional[dict[str, str]]

Build FQN to original HF safetensors dtype mapping for consolidated export.

Returns None when the run started from config-only weights or the original HF safetensors headers are not available. In that case consolidation keeps the saved checkpoint dtype unless the user explicitly passes CAST_DTYPE to the offline helper.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._maybe_log_final_offline_consolidation_hint(
model_dir: str,
is_final_checkpoint: bool = False
) -> None

Log the final-checkpoint helper hint when consolidated export was disabled.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer._maybe_write_offline_consolidation_script(
model_dir: str
) -> None

Write a conservative helper script for offline HF safetensors consolidation.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.async_wait() -> None

Wait for the async save to finish.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.close() -> None

Close the checkpointer.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.initialize_model_weights(
model: torch.nn.Module,
device: torch.device,
peft_init_method: str | None = None
) -> None
staticmethod

Materialize meta-device parameters and initialize model weights.

Moves empty parameter shells to the target device, resets HF initialization flags, calls the model’s weight initialization method, and initializes any PEFT adapters.

Parameters:

model
torch.nn.Module

Model whose weights should be initialized.

device
torch.device

Target device for materialized parameters.

peft_init_method
str | NoneDefaults to None

Initialization method for PEFT adapters (e.g. “xavier”).

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_base_model(
model: torch.nn.Module,
device: torch.device,
root_dir: str,
model_name: str | None,
load_base_model: bool = True
) -> None

Load a model from the base Hugging Face checkpoint in parallel.

Parameters:

model
torch.nn.Module

Model to load state into

device
torch.device

Device to load model onto

root_dir
str

Root directory of the model cache or snapshots

model_name
str | None

Name of the model or an absolute path to a snapshot

load_base_model
boolDefaults to True

If True, restore from HF base checkpoint

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_distributed_state(
state: typing.Any,
state_name: str,
path: str
) -> None

Load a custom stateful object previously saved with DCP.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_model(
model: torch.nn.Module,
model_path: str,
is_init_step: bool = False,
use_checkpoint_id: bool = True,
key_mapping: typing.Optional[dict[str, str]] = None,
allow_checkpoint_key_subset: bool = False
) -> None

Load model weights from model_path.

Behavior:

  • For PEFT (non-init): rank 0 reads adapter_model.safetensors, then broadcasts.
  • Otherwise: use DCP with a Hugging Face or default storage reader to populate the state dict.
  • If the model exposes a state_dict_adapter, convert to/from HF format as needed.
  • For models requiring tensor merging (e.g., Mixtral), uses transformers’ conversion mapping.

Parameters:

model
nn.Module

Model or parallelized model parts to load into.

model_path
str

Path to the model checkpoint directory or HF snapshot.

is_init_step
boolDefaults to False

If True, treat load as initialization from a base checkpoint.

use_checkpoint_id
boolDefaults to True

Pass checkpoint_id to DCP if True; disable when using direct HF paths.

key_mapping
Optional[dict[str, str]]Defaults to None

Optional key remapping when reading from HF checkpoints.

allow_checkpoint_key_subset
boolDefaults to False

If True, keep the model’s current initialization for parameters that are absent from the checkpoint instead of requiring an exact key match.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_on_dp_ranks(
state: typing.Any,
state_name: str,
path: str
) -> None

Load the stateful object.

This function is a helper function currently used to load the dataloader and rng state.

Parameters:

state
Any

Stateful object to load

state_name
str

Name of the stateful object

path
str

Path to load stateful object

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.load_optimizer(
optimizer: torch.optim.Optimizer,
model: torch.nn.Module,
weights_path: str,
scheduler: typing.Optional[typing.Any] = None
) -> None

Load optimizer (and optional scheduler) state from weights_path/optim using DCP.

Parameters:

optimizer
torch.optim.Optimizer

Optimizer to populate.

model
nn.Module

Model providing partitioning context for the optimizer wrapper.

weights_path
str

Base directory for checkpoints.

scheduler
Optional[Any]Defaults to None

Optional LR scheduler to populate.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.maybe_wait_for_staging() -> None

Wait for the staging to finish if it is enabled.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.save_distributed_state(
state: typing.Any,
state_name: str,
path: str
) -> None

Save a custom stateful object through DCP on all ranks.

This is intended for auxiliary objects whose state dict contains sharded tensors, for example BAGEL EMA shadows under FSDP2. Rank-0 torch.save would only persist rank 0’s local shard; DCP sees the DTensor metadata and writes all shards correctly.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.save_model(
model: torch.nn.Module,
weights_path: str,
peft_config: typing.Optional[peft.PeftConfig] = None,
tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None,
is_final_checkpoint: bool = False
) -> None

Save model weights to weights_path/model.

Behavior:

  • PEFT: write adapter_model.safetensors and metadata on rank 0.
  • Safetensors + consolidation: emit HF artifacts under weights_path/model/consolidated and build a consolidated index.
  • Otherwise: use DCP with a Hugging Face or default storage writer to save shards.

Parameters:

model
nn.Module

Model to checkpoint.

weights_path
str

Base directory for checkpoints.

peft_config
Optional[PeftConfig]Defaults to None

Optional PEFT configuration when saving adapters.

tokenizer
Optional[PreTrainedTokenizerBase]Defaults to None

Optional tokenizer to save with consolidated artifacts.

is_final_checkpoint
boolDefaults to False

Whether this save is the final scheduled training checkpoint.

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.save_on_dp_ranks(
state: typing.Any,
state_name: str,
path: str
) -> None

Save the stateful object.

This function is a helper function currently used to save the dataloader and rng state.

Parameters:

state
Any

Stateful object to save

state_name
str

Name of the stateful object

path
str

Path to save stateful object

nemo_automodel.components.checkpoint.checkpointing.Checkpointer.save_optimizer(
optimizer: torch.optim.Optimizer,
model: torch.nn.Module,
weights_path: str,
scheduler: typing.Optional[typing.Any] = None
) -> None

Save optimizer (and optional scheduler) state to weights_path/optim using DCP.

Parameters:

optimizer
torch.optim.Optimizer

Optimizer whose state will be saved.

model
nn.Module

Model providing partitioning context for the optimizer wrapper.

weights_path
str

Base directory for checkpoints.

scheduler
Optional[Any]Defaults to None

Optional LR scheduler to include.

class nemo_automodel.components.checkpoint.checkpointing._AsyncSaveContext(
stager: typing.Any | None,
process_group: typing.Any | None,
future: typing.Any | None,
staging_active: bool = False
)
Dataclass

Internal container for async checkpointing state.

One instance is maintained for the model save and one for the optimizer save to keep staging/upload futures and the associated process group and stager together in a single place.

future
Any | None
process_group
Any | None
stager
Any | None
staging_active
bool = False
nemo_automodel.components.checkpoint.checkpointing._adapter_path(
checkpoint_dir: str
) -> str

Return the PEFT adapter safetensors path inside a checkpoint dir (local or msc://).

nemo_automodel.components.checkpoint.checkpointing._apply(
module,
fn,
recurse = True
) -> torch.nn.Module

Apply a transformation function to parameters (and gradients) only.

Mirrors nn.Module.to_empty for parameters while skipping buffers. Respects future flags controlling in-place vs swap behavior and safely handles wrapper subclasses.

Parameters:

module

Module whose parameters are to be transformed.

fn

Callable applied to each parameter (and its gradient).

recurse
Defaults to True

Whether to recurse into child modules.

Returns: nn.Module

The same module instance after transformation.

nemo_automodel.components.checkpoint.checkpointing._apply_key_mapping(
state_dict: dict[str, torch.Tensor],
key_mapping: dict[str, str]
) -> dict[str, torch.Tensor]

Rename state-dict keys using regex-based key_mapping.

This mirrors the renaming logic used by the DCP / HuggingFace storage reader but operates directly on an in-memory state dict. It is needed when loading safetensors checkpoints outside of DCP so that HF checkpoint keys (e.g. language_model.model.X) are translated to the model’s parameter FQNs (e.g. model.language_model.X).

Parameters:

state_dict
dict[str, torch.Tensor]

Original state dict whose keys may need renaming.

key_mapping
dict[str, str]

{regex_pattern: replacement} pairs applied in order.

Returns: dict[str, torch.Tensor]

A new dict with renamed keys.

nemo_automodel.components.checkpoint.checkpointing._convert_checkpoint_with_transformers(
model: torch.nn.Module,
model_path: str,
key_mapping: typing.Optional[dict[str, str]] = None
) -> typing.Optional[dict[str, torch.Tensor]]

Convert a checkpoint using transformers’ conversion mapping for models that need tensor merging.

This handles MoE models like Mixtral where the checkpoint has individual expert weights but the model uses grouped expert tensors. The transformers library’s WeightConverter operations handle the tensor merging (MergeModulelist, Concatenate).

This function converts the state dict WITHOUT loading it into the model, so it can be used with FSDP-aware loading mechanisms.

Parameters:

model
nn.Module

The model (used to get conversion mapping and target keys).

model_path
str

Path to the HuggingFace checkpoint directory.

key_mapping
Optional[dict[str, str]]Defaults to None

Optional additional key mapping.

Returns: Optional[dict[str, torch.Tensor]]

Converted state dict ready for loading, or None if conversion failed.

nemo_automodel.components.checkpoint.checkpointing._divide_keys_by_size(
keys: list[str],
state_dict: dict[str, torch.Tensor],
target_shard_bytes: int
) -> dict[str, int]

Assign keys to deterministic size-based shards.

nemo_automodel.components.checkpoint.checkpointing._ensure_dirs(
dirs: typing.Optional[str] = ()
) -> None

Create directories on all ranks and synchronize across ranks.

Parameters:

*dirs
Optional[str]Defaults to ()

One or more directory paths that should exist.

nemo_automodel.components.checkpoint.checkpointing._ensure_msc_available() -> None

Raise an error if MSC is not installed but a cloud path is used.

nemo_automodel.components.checkpoint.checkpointing._equally_divide_layers(
num_shards: int,
keys: list[str]
) -> dict[str, int]

Equally divide the state dict keys into num_shards shards.

nemo_automodel.components.checkpoint.checkpointing._get_checkpoint_metadata_keys(
path: str,
storage_reader: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader] = None
) -> set[str]

Return checkpoint FQNs present in metadata.

nemo_automodel.components.checkpoint.checkpointing._get_hf_safetensors_reference_path(
cache_dir: str | pathlib.Path | None,
repo_id: str | None
) -> str | None

Return the local HF safetensors reference directory for a model.

Prefer the snapshot directory containing model.safetensors.index.json for sharded checkpoints. If no index exists but a snapshot directory is present, return that directory as the single-file safetensors reference path. Return None when repo_id is None or the repo has no cached snapshot directory.

For example, if the located file is

/opt/models/models—meta-llama—Llama-3.2-3B/snapshots/13afe…/model.safetensors.index.json

this function will return the directory path

/opt/models/models—meta-llama—Llama-3.2-3B/snapshots/13afe…

This will error if the model hasn’t been downloaded or if the cache directory is incorrect.

Parameters:

cache_dir
str | Path | None

Path to cache directory

repo_id
str | None

Hugging Face repository ID

Returns: str | None

Path to the snapshot/model directory containing safetensors weights, or

nemo_automodel.components.checkpoint.checkpointing._get_original_hf_index_total_size(
config: nemo_automodel.components.checkpoint.config.CheckpointingConfig
) -> int | None

Return the original HF safetensors index total size, if available.

nemo_automodel.components.checkpoint.checkpointing._init_peft_adapters(
model: torch.nn.Module,
peft_init_method: str
) -> None

Initialize the PEFT adapters with the scaled weights.

Parameters:

model
nn.Module

Model to initialize PEFT adapters for

peft_init_method
str

Method to initialize PEFT adapters e.g. “xavier”. See LinearLoRA for more details.

nemo_automodel.components.checkpoint.checkpointing._is_bin_checkpoint(
path: str
) -> bool

Return True if path looks like a PyTorch .bin checkpoint.

nemo_automodel.components.checkpoint.checkpointing._is_custom_model(
module: torch.nn.Module
) -> bool

True if the model has a custom implementation in nemo_automodel/components/models/.

The generic HFCheckpointingMixin (in .common.hf_checkpointing_mixin) is injected into every model by _get_mixin_wrapped_class and does NOT count as a “custom model”. Only actual model implementations (e.g. llama, deepseek_v3) that live under nemo_automodel.components.models qualify.

nemo_automodel.components.checkpoint.checkpointing._is_remote_code_model(
module: torch.nn.Module
) -> bool

True if the model was loaded with trust_remote_code (HF dynamic modules).

nemo_automodel.components.checkpoint.checkpointing._is_safetensors_checkpoint(
path: str
) -> bool

Return True if path looks like a safetensors checkpoint (so we can preserve dtype); else DCP or other.

nemo_automodel.components.checkpoint.checkpointing._load_full_state_dict_into_model(
model_parts: list[torch.nn.Module],
state_dict: dict[str, torch.Tensor]
) -> None

Load a full (non-sharded) state dict into a potentially FSDP-wrapped model.

Every rank must supply the full state dict. PyTorch’s set_model_state_dict with full_state_dict=True (but not broadcast_from_rank0) calls _distribute_state_dict which lets each rank independently slice its local DTensor shard from the full tensor — no NCCL collectives are needed.

We intentionally avoid broadcast_from_rank0=True because it introduces an asymmetric workload: rank 0 does a synchronous CPU→GPU copy (.to(device)) per tensor while other ranks only do torch.empty (async allocation). The non-src ranks race ahead enqueuing hundreds of NCCL broadcasts that rank 0 cannot keep up with, leading to a 60 s NCCL watchdog timeout.

After loading, floating-point parameters are converted to match the checkpoint dtype. PyTorch’s set_model_state_dict uses copy semantics (assign=False) for non-meta parameters, which preserves the model’s initialisation dtype instead of the checkpoint dtype. The post-load fixup ensures the safetensors dtype (e.g. bf16) is honoured.

Parameters:

model_parts
list[nn.Module]

List of model parts (for pipeline parallelism)

state_dict
dict[str, torch.Tensor]

Full state dict with regular tensors. Must be populated on every rank (not just rank 0).

nemo_automodel.components.checkpoint.checkpointing._load_hf_bin_checkpoint(
model_path: str,
weights_only: bool = True
) -> typing.Optional[dict[str, torch.Tensor]]

Load a HuggingFace .bin checkpoint into a state dict.

Handles single-file (pytorch_model.bin), sharded (pytorch_model.bin.index.json), and glob fallback (*.bin) layouts. Returns None if no .bin files are found.

Parameters:

model_path
str

Path to checkpoint file or directory.

weights_only
boolDefaults to True

Passed to torch.load. Default True for safety; set to False for remote-code models whose checkpoints may contain custom pickled objects.

nemo_automodel.components.checkpoint.checkpointing._load_hf_checkpoint_preserving_dtype(
model_path: str,
weights_only: bool = True
) -> typing.Optional[dict[str, torch.Tensor]]

Load a HuggingFace checkpoint into a new state dict so tensor dtypes match the checkpoint (e.g. bf16). Used when loading the base model so FSDP sees uniform dtype instead of the model’s init dtypes (e.g. float32). Prefers safetensors but falls back to .bin files. Returns None if no loadable checkpoint is found.

Parameters:

model_path
str

Path to checkpoint file or directory.

weights_only
boolDefaults to True

Forwarded to torch.load when loading .bin files.

nemo_automodel.components.checkpoint.checkpointing._load_hf_safetensors_checkpoint(
model_path: str
) -> typing.Optional[dict[str, torch.Tensor]]

Load a safetensors checkpoint into a state dict.

nemo_automodel.components.checkpoint.checkpointing._load_safetensors(
path: str
) -> dict[str, torch.Tensor]

Read a safetensors file from a local path or an msc:// cloud path.

nemo_automodel.components.checkpoint.checkpointing._materialize_to_hf_views_for_save(
state_dict: dict[str, torch.Tensor]
) -> None

Replace non-contiguous tensor values in state_dict with contiguous copies in place.

MoE adapters return non-contiguous strided views into the model’s grouped expert storage for the optimized load path; safetensors.torch.save (which the DCP HF storage writer calls) rejects non-contiguous tensors, so we materialize one tensor at a time here with empty_cache between iterations. Per-tensor transient is bounded to a single expert weight instead of allocating the full grouped set up front.

nemo_automodel.components.checkpoint.checkpointing._maybe_adapt_state_dict_from_hf(
model_part: torch.nn.Module,
state_dict: dict[str, torch.Tensor],
moe_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None
) -> dict[str, torch.Tensor]

Custom models use state dict adapters to convert the state dict from the Hugging Face format to the native format.

nemo_automodel.components.checkpoint.checkpointing._maybe_adapt_state_dict_to_hf(
model_part: torch.nn.Module,
state_dict: dict[str, torch.Tensor],
quantization: bool = False,
kwargs = {}
) -> dict[str, torch.Tensor]

Custom models use state dict adapters to convert the state dict to the Hugging Face format.

nemo_automodel.components.checkpoint.checkpointing._maybe_msc_reader(
path: str,
storage_reader: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader]
) -> typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageReader]

Return an MSC filesystem reader for msc:// paths, else the given reader.

nemo_automodel.components.checkpoint.checkpointing._maybe_msc_writer(
path: str,
storage_writer: typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageWriter]
) -> typing.Optional[nemo_automodel.components.checkpoint._backports.hf_storage._HuggingFaceStorageWriter]

Return an MSC filesystem writer for msc:// paths, else the given writer.

nemo_automodel.components.checkpoint.checkpointing._model_has_dtensors(
module: torch.nn.Module
) -> bool

True if any parameter is a DTensor (model is already sharded).

nemo_automodel.components.checkpoint.checkpointing._normalize_dtype_mapping_to_state_dict_keys(
fqn_to_dtype_mapping: dict[str, str],
state_dict_keys: list[str],
base_model_prefix: str | None = None
) -> dict[str, str]

Align original HF dtype metadata with the keys that will be exported.

nemo_automodel.components.checkpoint.checkpointing._reinit_non_persistent_buffers(
model: torch.nn.Module,
device: torch.device,
model_type: str | None = None
) -> None

Recompute non-persistent buffers that are not saved in checkpoints.

Non-persistent buffers are not saved in checkpoints, so after meta-device materialization they contain uninitialized CUDA memory. When initialize_weights() is skipped (e.g. for Gemma3 to avoid DTensor issues), these buffers must be recomputed explicitly.

Only runs for models listed in _MODELS_REQUIRING_BUFFER_REINIT to avoid unexpected side-effects on arbitrary HF Hub models.

Handles four patterns:

  1. Standard RoPE — single inv_freq buffer with rope_init_fn + rope_kwargs (e.g. Nemotron-NAS).
  2. Per-layer-type RoPE{layer_type}_inv_freq buffers via compute_default_rope_parameters (e.g. Gemma3RotaryEmbedding).
  3. Scaled embeddingembed_scale buffer on ScaledWordEmbedding modules (Gemma family), recomputed from scalar_embed_scale.
  4. Vision position IDsposition_ids buffer on vision embedding modules (SigLIP), recomputed from num_positions.

Parameters:

model
nn.Module

Model to reinitialize non-persistent buffers for.

device
torch.device

Device to create the new buffers on.

model_type
str | NoneDefaults to None

The config.model_type string. If not in _MODELS_REQUIRING_BUFFER_REINIT the function is a no-op.

nemo_automodel.components.checkpoint.checkpointing._save_safetensors(
state_dict: dict[str, torch.Tensor],
path: str
) -> None

Write a safetensors file to a local path or an msc:// cloud path.

For cloud paths the tensors are serialized to bytes and streamed to the MSC file handle, since save_file only accepts a local filesystem path.

nemo_automodel.components.checkpoint.checkpointing._should_write_consolidated_safetensors(
config: nemo_automodel.components.checkpoint.config.CheckpointingConfig,
is_final_checkpoint: bool = False
) -> bool

Whether to output consolidated HF weights along with sharded weights.

nemo_automodel.components.checkpoint.checkpointing._should_write_hf_metadata(
config: nemo_automodel.components.checkpoint.config.CheckpointingConfig
) -> bool

Whether to write HF metadata/artifacts for a checkpoint.

nemo_automodel.components.checkpoint.checkpointing._summarize_state_dict_key_diff(
expected_keys: set[str],
loaded_keys: set[str],
limit: int = 10
) -> dict[str, typing.Any]

Summarize state-dict key mismatches for checkpoint load diagnostics.

nemo_automodel.components.checkpoint.checkpointing._warn_if_inline_consolidation_enabled(
config: nemo_automodel.components.checkpoint.config.CheckpointingConfig
) -> None

Educate users about the cost of inline HF consolidation.

nemo_automodel.components.checkpoint.checkpointing._warn_if_large_inline_consolidation(
config: nemo_automodel.components.checkpoint.config.CheckpointingConfig,
state_dict: dict[str, torch.Tensor],
fqn_to_index_mapping: typing.Optional[dict[str, int]],
is_final_checkpoint: bool = False
) -> None

Warn when inline consolidated export is large enough to waste GPU allocation time.

nemo_automodel.components.checkpoint.checkpointing.is_cloud_path(
path: str
) -> bool

Check if path is a cloud storage path (MSC).

nemo_automodel.components.checkpoint.checkpointing.save_config(
config: dict[str, typing.Any],
weights_path: str
) -> None

Save a config to a weights path.

Parameters:

config
dict[str, Any]

Config to save

weights_path
str

Path to save config

nemo_automodel.components.checkpoint.checkpointing.to_empty_parameters_only(
model: torch.nn.Module,
device: torch.device,
recurse: bool = True,
dtype: torch.dtype | None = None
) -> torch.nn.Module

Move parameters to the specified device without copying storage, skipping buffers.

Mirrors torch.nn.Module.to_empty but applies only to parameters, not buffers.

Parameters:

model
nn.Module

The module to transform

device
torch.device

Target device

recurse
boolDefaults to True

Whether to recurse into child modules

Returns: nn.Module

The same module instance

nemo_automodel.components.checkpoint.checkpointing.MSC_AVAILABLE = True
nemo_automodel.components.checkpoint.checkpointing._CONSOLIDATED_SIZE_WARNING_THRESHOLD_BYTES = 50 * 1024 ** 3
nemo_automodel.components.checkpoint.checkpointing._DEFAULT_HF_CONSOLIDATED_SHARD_SIZE_BYTES = 5 * 1024 ** 3
nemo_automodel.components.checkpoint.checkpointing._MODELS_REQUIRING_BUFFER_REINIT: frozenset[str] = frozenset({'gemma3', 'nemotron-nas'})
nemo_automodel.components.checkpoint.checkpointing.logger = logging.getLogger(__name__)