nemo_automodel.components.checkpoint.utils
nemo_automodel.components.checkpoint.utils
Module Contents
Functions
API
Inspect checkpoint tensors and return their exact dtypes by key.
This reads checkpoint metadata only by loading tensors on the meta
device, so it preserves the per-tensor dtype information without
materializing full checkpoint weights in memory.
Return a module by FQN after applying wrapper-prefix normalization.
Strip wrapper-specific prefixes from a parameter name.
Return whether two tensors are aliases of the same local storage.
Ensure a local tied LM head actually aliases the input embedding.
Hugging Face tie_weights() is the first choice because model classes can
have custom tying rules. The direct assignment fallback handles wrapped
models whose generic tie_weights() no longer reaches the local
lm_head/embedding pair after sharding.
Parameters:
Model or pipeline stage to inspect and update.
Returns: bool
True if the local lm_head and input embedding are tied after the
Estimate logical bytes in a state dict without materializing tensors.
Estimate logical bytes in a tensor without materializing it.
Format bytes as a human-readable GiB value.
Format the output shard count for user-facing log messages.
Return the input embedding weight and normalized name if present.
Parameters:
Model to inspect.
Returns: torch.Tensor | None
Tuple of the embedding weight tensor and its normalized FQN, or
Return the first lm_head.weight parameter found on a model.
Parameters:
Model to inspect.
Returns: torch.Tensor | None
Tuple of the parameter tensor and its normalized FQN, or (None, None)
Return the current distributed rank, defaulting to 0 when not initialized.
Return the total checkpoint size recorded in a Hugging Face safetensors index.
Return candidate checkpoint keys that can source a tied LM head.
Parameters:
Model or pipeline stage to inspect.
Optional normalized LM head FQN.
Returns: list[str]
Ordered list of possible source FQNs.
Return the current distributed world size, defaulting to 1 when not initialized.
Return whether the current model partition has an actual tied LM head.
This is stricter than is_tied_word_embeddings(): pipeline stages often
keep the config flag set to True even when lm_head and
embed_tokens live on different partitions. Some custom models can also
declare tied embeddings in config without actually aliasing the parameters.
In that case omitting lm_head.weight from a checkpoint loses trained
state, so only treat it as safely tied when the local tensors share storage.
Parameters:
Model or pipeline stage to inspect.
Returns: bool
True when the model is configured with tied word embeddings, both
Return True on the main rank.
Check if the model’s word embeddings are tied.
Parameters:
The model to check.
Returns: bool
True if the model’s word embeddings are tied, False otherwise.
Populate a missing tied lm_head.weight from its embedding source.
Hugging Face checkpoints for tied-embedding models often omit
lm_head.weight entirely. That is fine for unsplit models where
tie_weights() can restore the alias, but it breaks pipeline-parallel last
stages which own lm_head but not embed_tokens.
Parameters:
Checkpoint state dict to mutate in place.
Target model or pipeline stage.
If True, fall back to the current
lm_head tensor when the tied source cannot be found in
state_dict. This preserves legacy resume behavior for older
checkpoints that were saved without a local lm_head.weight.
Returns: bool
True if a missing lm_head.weight was materialized, else False.
Whitelist NVIDIA models to allow remote code execution.
Parameters:
The name or path of the pretrained model.
Returns:
True if the model should be loaded with trust_remote_code, False otherwise.