nemo_automodel.components.checkpoint.utils

Module Contents

Functions

Name	Description
`_get_checkpoint_tensor_dtypes`	Inspect checkpoint tensors and return their exact dtypes by key.
`_get_module_by_normalized_name`	Return a module by FQN after applying wrapper-prefix normalization.
`_normalize_param_name`	Strip wrapper-specific prefixes from a parameter name.
`_same_tensor_storage`	Return whether two tensors are aliases of the same local storage.
`ensure_tied_lm_head`	Ensure a local tied LM head actually aliases the input embedding.
`estimate_state_dict_bytes`	Estimate logical bytes in a state dict without materializing tensors.
`estimate_tensor_bytes`	Estimate logical bytes in a tensor without materializing it.
`format_bytes`	Format bytes as a human-readable GiB value.
`format_output_file_count`	Format the output shard count for user-facing log messages.
`get_input_embeddings_weight_and_name`	Return the input embedding weight and normalized name if present.
`get_lm_head_weight_and_name`	Return the first `lm_head.weight` parameter found on a model.
`get_rank_safe`	Return the current distributed rank, defaulting to 0 when not initialized.
`get_safetensors_index_total_size`	Return the total checkpoint size recorded in a Hugging Face safetensors index.
`get_tied_lm_head_source_names`	Return candidate checkpoint keys that can source a tied LM head.
`get_world_size_safe`	Return the current distributed world size, defaulting to 1 when not initialized.
`has_local_tied_lm_head`	Return whether the current model partition has an actual tied LM head.
`is_rank_0`	Return True on the main rank.
`is_tied_word_embeddings`	Check if the model’s word embeddings are tied.
`materialize_missing_tied_lm_head`	Populate a missing tied `lm_head.weight` from its embedding source.
`resolve_trust_remote_code`	Whitelist NVIDIA models to allow remote code execution.

API

nemo_automodel.components.checkpoint.utils._get_checkpoint_tensor_dtypes(
    pretrained_model_name_or_path: str,
    hf_config: typing.Any,
    load_kwargs: collections.abc.Mapping[str, object] | None = None
) -> dict[str, torch.dtype]

Inspect checkpoint tensors and return their exact dtypes by key.

This reads checkpoint metadata only by loading tensors on the meta device, so it preserves the per-tensor dtype information without materializing full checkpoint weights in memory.

nemo_automodel.components.checkpoint.utils._get_module_by_normalized_name(
    model: torch.nn.Module,
    normalized_module_name: str
) -> torch.nn.Module | None

Return a module by FQN after applying wrapper-prefix normalization.

nemo_automodel.components.checkpoint.utils._normalize_param_name(
    name: str
) -> str

Strip wrapper-specific prefixes from a parameter name.

nemo_automodel.components.checkpoint.utils._same_tensor_storage(
    left: torch.Tensor,
    right: torch.Tensor
) -> bool

Return whether two tensors are aliases of the same local storage.

nemo_automodel.components.checkpoint.utils.ensure_tied_lm_head(
    model: torch.nn.Module
) -> bool

Ensure a local tied LM head actually aliases the input embedding.

Hugging Face tie_weights() is the first choice because model classes can have custom tying rules. The direct assignment fallback handles wrapped models whose generic tie_weights() no longer reaches the local lm_head/embedding pair after sharding.

Parameters:

model

nn.Module

Model or pipeline stage to inspect and update.

Returns: bool

True if the local lm_head and input embedding are tied after the

nemo_automodel.components.checkpoint.utils.estimate_state_dict_bytes(
    state_dict: dict[str, torch.Tensor]
) -> int | None

Estimate logical bytes in a state dict without materializing tensors.

nemo_automodel.components.checkpoint.utils.estimate_tensor_bytes(
    tensor: torch.Tensor
) -> int

Estimate logical bytes in a tensor without materializing it.

nemo_automodel.components.checkpoint.utils.format_bytes(
    num_bytes: int
) -> str

Format bytes as a human-readable GiB value.

nemo_automodel.components.checkpoint.utils.format_output_file_count(
    count: int
) -> str

Format the output shard count for user-facing log messages.

nemo_automodel.components.checkpoint.utils.get_input_embeddings_weight_and_name(
    model: torch.nn.Module
) -> tuple[torch.Tensor | None, str | None]

Return the input embedding weight and normalized name if present.

Parameters:

model

nn.Module

Model to inspect.

Returns: torch.Tensor | None

Tuple of the embedding weight tensor and its normalized FQN, or

nemo_automodel.components.checkpoint.utils.get_lm_head_weight_and_name(
    model: torch.nn.Module
) -> tuple[torch.Tensor | None, str | None]

Return the first lm_head.weight parameter found on a model.

Parameters:

model

nn.Module

Model to inspect.

Returns: torch.Tensor | None

Tuple of the parameter tensor and its normalized FQN, or (None, None)

nemo_automodel.components.checkpoint.utils.get_rank_safe() -> int

Return the current distributed rank, defaulting to 0 when not initialized.

nemo_automodel.components.checkpoint.utils.get_safetensors_index_total_size(
    index_path: str | None
) -> int | None

Return the total checkpoint size recorded in a Hugging Face safetensors index.

nemo_automodel.components.checkpoint.utils.get_tied_lm_head_source_names(
    model: torch.nn.Module,
    lm_head_param_name: str | None = None
) -> list[str]

Return candidate checkpoint keys that can source a tied LM head.

Parameters:

model

nn.Module

Model or pipeline stage to inspect.

lm_head_param_name

str | NoneDefaults to None

Optional normalized LM head FQN.

Returns: list[str]

Ordered list of possible source FQNs.

nemo_automodel.components.checkpoint.utils.get_world_size_safe() -> int

Return the current distributed world size, defaulting to 1 when not initialized.

nemo_automodel.components.checkpoint.utils.has_local_tied_lm_head(
    model: torch.nn.Module
) -> bool

Return whether the current model partition has an actual tied LM head.

This is stricter than is_tied_word_embeddings(): pipeline stages often keep the config flag set to True even when lm_head and embed_tokens live on different partitions. Some custom models can also declare tied embeddings in config without actually aliasing the parameters. In that case omitting lm_head.weight from a checkpoint loses trained state, so only treat it as safely tied when the local tensors share storage.

Parameters:

model

nn.Module

Model or pipeline stage to inspect.

Returns: bool

True when the model is configured with tied word embeddings, both

nemo_automodel.components.checkpoint.utils.is_rank_0() -> bool

Return True on the main rank.

nemo_automodel.components.checkpoint.utils.is_tied_word_embeddings(
    model: torch.nn.Module
) -> bool

Check if the model’s word embeddings are tied.

Parameters:

model

nn.Module

The model to check.

Returns: bool

True if the model’s word embeddings are tied, False otherwise.

nemo_automodel.components.checkpoint.utils.materialize_missing_tied_lm_head(
    state_dict: dict[str, typing.Any],
    model: torch.nn.Module,
    allow_current_lm_head_fallback: bool = False
) -> bool

Populate a missing tied lm_head.weight from its embedding source.

Hugging Face checkpoints for tied-embedding models often omit lm_head.weight entirely. That is fine for unsplit models where tie_weights() can restore the alias, but it breaks pipeline-parallel last stages which own lm_head but not embed_tokens.

Parameters:

state_dict

dict[str, Any]

Checkpoint state dict to mutate in place.

model

nn.Module

Target model or pipeline stage.

allow_current_lm_head_fallback

boolDefaults to False

If True, fall back to the current lm_head tensor when the tied source cannot be found in state_dict. This preserves legacy resume behavior for older checkpoints that were saved without a local lm_head.weight.

Returns: bool

True if a missing lm_head.weight was materialized, else False.

nemo_automodel.components.checkpoint.utils.resolve_trust_remote_code(
    pretrained_model_name_or_path
)

Whitelist NVIDIA models to allow remote code execution.

Parameters:

pretrained_model_name_or_path

str

The name or path of the pretrained model.

Returns:

True if the model should be loaded with trust_remote_code, False otherwise.