nemo_automodel.components.checkpoint.utils

View as Markdown

Module Contents

Functions

NameDescription
_get_checkpoint_tensor_dtypesInspect checkpoint tensors and return their exact dtypes by key.
_get_module_by_normalized_nameReturn a module by FQN after applying wrapper-prefix normalization.
_normalize_param_nameStrip wrapper-specific prefixes from a parameter name.
_same_tensor_storageReturn whether two tensors are aliases of the same local storage.
ensure_tied_lm_headEnsure a local tied LM head actually aliases the input embedding.
estimate_state_dict_bytesEstimate logical bytes in a state dict without materializing tensors.
estimate_tensor_bytesEstimate logical bytes in a tensor without materializing it.
format_bytesFormat bytes as a human-readable GiB value.
format_output_file_countFormat the output shard count for user-facing log messages.
get_input_embeddings_weight_and_nameReturn the input embedding weight and normalized name if present.
get_lm_head_weight_and_nameReturn the first lm_head.weight parameter found on a model.
get_rank_safeReturn the current distributed rank, defaulting to 0 when not initialized.
get_safetensors_index_total_sizeReturn the total checkpoint size recorded in a Hugging Face safetensors index.
get_tied_lm_head_source_namesReturn candidate checkpoint keys that can source a tied LM head.
get_world_size_safeReturn the current distributed world size, defaulting to 1 when not initialized.
has_local_tied_lm_headReturn whether the current model partition has an actual tied LM head.
is_rank_0Return True on the main rank.
is_tied_word_embeddingsCheck if the model’s word embeddings are tied.
materialize_missing_tied_lm_headPopulate a missing tied lm_head.weight from its embedding source.
resolve_trust_remote_codeWhitelist NVIDIA models to allow remote code execution.

API

nemo_automodel.components.checkpoint.utils._get_checkpoint_tensor_dtypes(
pretrained_model_name_or_path: str,
hf_config: typing.Any,
load_kwargs: collections.abc.Mapping[str, object] | None = None
) -> dict[str, torch.dtype]

Inspect checkpoint tensors and return their exact dtypes by key.

This reads checkpoint metadata only by loading tensors on the meta device, so it preserves the per-tensor dtype information without materializing full checkpoint weights in memory.

nemo_automodel.components.checkpoint.utils._get_module_by_normalized_name(
model: torch.nn.Module,
normalized_module_name: str
) -> torch.nn.Module | None

Return a module by FQN after applying wrapper-prefix normalization.

nemo_automodel.components.checkpoint.utils._normalize_param_name(
name: str
) -> str

Strip wrapper-specific prefixes from a parameter name.

nemo_automodel.components.checkpoint.utils._same_tensor_storage(
left: torch.Tensor,
right: torch.Tensor
) -> bool

Return whether two tensors are aliases of the same local storage.

nemo_automodel.components.checkpoint.utils.ensure_tied_lm_head(
model: torch.nn.Module
) -> bool

Ensure a local tied LM head actually aliases the input embedding.

Hugging Face tie_weights() is the first choice because model classes can have custom tying rules. The direct assignment fallback handles wrapped models whose generic tie_weights() no longer reaches the local lm_head/embedding pair after sharding.

Parameters:

model
nn.Module

Model or pipeline stage to inspect and update.

Returns: bool

True if the local lm_head and input embedding are tied after the

nemo_automodel.components.checkpoint.utils.estimate_state_dict_bytes(
state_dict: dict[str, torch.Tensor]
) -> int | None

Estimate logical bytes in a state dict without materializing tensors.

nemo_automodel.components.checkpoint.utils.estimate_tensor_bytes(
tensor: torch.Tensor
) -> int

Estimate logical bytes in a tensor without materializing it.

nemo_automodel.components.checkpoint.utils.format_bytes(
num_bytes: int
) -> str

Format bytes as a human-readable GiB value.

nemo_automodel.components.checkpoint.utils.format_output_file_count(
count: int
) -> str

Format the output shard count for user-facing log messages.

nemo_automodel.components.checkpoint.utils.get_input_embeddings_weight_and_name(
model: torch.nn.Module
) -> tuple[torch.Tensor | None, str | None]

Return the input embedding weight and normalized name if present.

Parameters:

model
nn.Module

Model to inspect.

Returns: torch.Tensor | None

Tuple of the embedding weight tensor and its normalized FQN, or

nemo_automodel.components.checkpoint.utils.get_lm_head_weight_and_name(
model: torch.nn.Module
) -> tuple[torch.Tensor | None, str | None]

Return the first lm_head.weight parameter found on a model.

Parameters:

model
nn.Module

Model to inspect.

Returns: torch.Tensor | None

Tuple of the parameter tensor and its normalized FQN, or (None, None)

nemo_automodel.components.checkpoint.utils.get_rank_safe() -> int

Return the current distributed rank, defaulting to 0 when not initialized.

nemo_automodel.components.checkpoint.utils.get_safetensors_index_total_size(
index_path: str | None
) -> int | None

Return the total checkpoint size recorded in a Hugging Face safetensors index.

nemo_automodel.components.checkpoint.utils.get_tied_lm_head_source_names(
model: torch.nn.Module,
lm_head_param_name: str | None = None
) -> list[str]

Return candidate checkpoint keys that can source a tied LM head.

Parameters:

model
nn.Module

Model or pipeline stage to inspect.

lm_head_param_name
str | NoneDefaults to None

Optional normalized LM head FQN.

Returns: list[str]

Ordered list of possible source FQNs.

nemo_automodel.components.checkpoint.utils.get_world_size_safe() -> int

Return the current distributed world size, defaulting to 1 when not initialized.

nemo_automodel.components.checkpoint.utils.has_local_tied_lm_head(
model: torch.nn.Module
) -> bool

Return whether the current model partition has an actual tied LM head.

This is stricter than is_tied_word_embeddings(): pipeline stages often keep the config flag set to True even when lm_head and embed_tokens live on different partitions. Some custom models can also declare tied embeddings in config without actually aliasing the parameters. In that case omitting lm_head.weight from a checkpoint loses trained state, so only treat it as safely tied when the local tensors share storage.

Parameters:

model
nn.Module

Model or pipeline stage to inspect.

Returns: bool

True when the model is configured with tied word embeddings, both

nemo_automodel.components.checkpoint.utils.is_rank_0() -> bool

Return True on the main rank.

nemo_automodel.components.checkpoint.utils.is_tied_word_embeddings(
model: torch.nn.Module
) -> bool

Check if the model’s word embeddings are tied.

Parameters:

model
nn.Module

The model to check.

Returns: bool

True if the model’s word embeddings are tied, False otherwise.

nemo_automodel.components.checkpoint.utils.materialize_missing_tied_lm_head(
state_dict: dict[str, typing.Any],
model: torch.nn.Module,
allow_current_lm_head_fallback: bool = False
) -> bool

Populate a missing tied lm_head.weight from its embedding source.

Hugging Face checkpoints for tied-embedding models often omit lm_head.weight entirely. That is fine for unsplit models where tie_weights() can restore the alias, but it breaks pipeline-parallel last stages which own lm_head but not embed_tokens.

Parameters:

state_dict
dict[str, Any]

Checkpoint state dict to mutate in place.

model
nn.Module

Target model or pipeline stage.

allow_current_lm_head_fallback
boolDefaults to False

If True, fall back to the current lm_head tensor when the tied source cannot be found in state_dict. This preserves legacy resume behavior for older checkpoints that were saved without a local lm_head.weight.

Returns: bool

True if a missing lm_head.weight was materialized, else False.

nemo_automodel.components.checkpoint.utils.resolve_trust_remote_code(
pretrained_model_name_or_path
)

Whitelist NVIDIA models to allow remote code execution.

Parameters:

pretrained_model_name_or_path
str

The name or path of the pretrained model.

Returns:

True if the model should be loaded with trust_remote_code, False otherwise.