> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.utils.model_utils

## Module Contents

### Functions

| Name                                                                                                                                | Description                                                                                       |
| ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| [`_freeze_module_by_attribute_and_patterns`](#nemo_automodel-components-utils-model_utils-_freeze_module_by_attribute_and_patterns) | Helper function to freeze parameters by attribute name and name patterns.                         |
| [`_get_forward_signature`](#nemo_automodel-components-utils-model_utils-_get_forward_signature)                                     | Best-effort retrieval of `model.forward` signature.                                               |
| [`_get_logical_numel`](#nemo_automodel-components-utils-model_utils-_get_logical_numel)                                             | Return the logical number of elements for a parameter,                                            |
| [`_get_model_param_stats`](#nemo_automodel-components-utils-model_utils-_get_model_param_stats)                                     | Get the number of trainable parameters and the L2 norm of the model.                              |
| [`_supports_logits_to_keep`](#nemo_automodel-components-utils-model_utils-_supports_logits_to_keep)                                 | Check if the model supports logits\_to\_keep.                                                     |
| [`_supports_seq_lens`](#nemo_automodel-components-utils-model_utils-_supports_seq_lens)                                             | Check if the model's forward() accepts seq\_lens.                                                 |
| [`apply_parameter_freezing`](#nemo_automodel-components-utils-model_utils-apply_parameter_freezing)                                 | Apply parameter freezing based on configuration.                                                  |
| [`cast_mixed_dtype_params_to_bf16`](#nemo_automodel-components-utils-model_utils-cast_mixed_dtype_params_to_bf16)                   | Cast fp32 parameters and buffers to bf16 for FSDP2 compatibility.                                 |
| [`count_model_parameters`](#nemo_automodel-components-utils-model_utils-count_model_parameters)                                     | Count total and trainable parameters. Safe to call on meta-device models.                         |
| [`enable_radio_vit_fused_attn`](#nemo_automodel-components-utils-model_utils-enable_radio_vit_fused_attn)                           | Route RADIO ViT attention through `F.scaled_dot_product_attention`.                               |
| [`filter_forward_kwargs`](#nemo_automodel-components-utils-model_utils-filter_forward_kwargs)                                       | Drop kwargs that `model.forward` does not accept.                                                 |
| [`freeze_deepseek_v4_indexer_params`](#nemo_automodel-components-utils-model_utils-freeze_deepseek_v4_indexer_params)               | Freeze DeepSeek V4 indexer params that only feed discrete top-k masks.                            |
| [`freeze_minimax_m3_indexer_params`](#nemo_automodel-components-utils-model_utils-freeze_minimax_m3_indexer_params)                 | Freeze MiniMax M3 lightning-indexer params that only feed discrete top-k masks.                   |
| [`freeze_unused_kv_sharing_params`](#nemo_automodel-components-utils-model_utils-freeze_unused_kv_sharing_params)                   | Freeze dead K/V parameters in KV-shared layers.                                                   |
| [`get_lm_head_module`](#nemo_automodel-components-utils-model_utils-get_lm_head_module)                                             | Return the model's LM head module, if one can be found.                                           |
| [`get_lm_head_weight`](#nemo_automodel-components-utils-model_utils-get_lm_head_weight)                                             | Return the model's LM-head weight, materializing DTensor weights when needed.                     |
| [`init_empty_weights`](#nemo_automodel-components-utils-model_utils-init_empty_weights)                                             | A context manager under which models are initialized with all parameters on the specified device. |
| [`print_trainable_parameters`](#nemo_automodel-components-utils-model_utils-print_trainable_parameters)                             | Print the number of trainable parameters in the model.                                            |
| [`resolve_trust_remote_code`](#nemo_automodel-components-utils-model_utils-resolve_trust_remote_code)                               | Whitelist NVIDIA models to allow remote code execution.                                           |
| [`skip_random_init`](#nemo_automodel-components-utils-model_utils-skip_random_init)                                                 | Context manager to skip random weight initialization when loading pretrained models.              |
| [`squeeze_input_for_thd`](#nemo_automodel-components-utils-model_utils-squeeze_input_for_thd)                                       | Squeeze batch dimension and prepare inputs for THD (total, hidden, depth) format.                 |

### Data

[`VLM_INPUT_KEYS`](#nemo_automodel-components-utils-model_utils-VLM_INPUT_KEYS)

[`logger`](#nemo_automodel-components-utils-model_utils-logger)

### API

```python
nemo_automodel.components.utils.model_utils._freeze_module_by_attribute_and_patterns(
    model,
    attribute_name,
    name_patterns
)
```

Helper function to freeze parameters by attribute name and name patterns.

**Parameters:**

The model to apply freezing to.

Name of the model attribute to freeze (e.g., 'vision\_tower').

List of patterns to match in module names.

```python
nemo_automodel.components.utils.model_utils._get_forward_signature(
    model: torch.nn.Module
) -> inspect.Signature | None
```

Best-effort retrieval of `model.forward` signature.

```python
nemo_automodel.components.utils.model_utils._get_logical_numel(
    param
) -> int
```

Return the logical number of elements for a parameter,
accounting for quantized (packed) storage.

For bitsandbytes 4-bit params (Params4bit), the physical tensor
packs multiple values per byte. We recover the logical count from
the original shape stored in param.quant\_state.

```python
nemo_automodel.components.utils.model_utils._get_model_param_stats(
    model: torch.nn.Module
) -> tuple[int, int, float]
```

Get the number of trainable parameters and the L2 norm of the model.

**Parameters:**

Model to analyze

**Returns:** `int`

int

```python
nemo_automodel.components.utils.model_utils._supports_logits_to_keep(
    model: torch.nn.Module
) -> bool
```

Check if the model supports logits\_to\_keep.

**Parameters:**

The model to check.

**Returns:** `bool`

True if the model supports logits\_to\_keep, False otherwise.

```python
nemo_automodel.components.utils.model_utils._supports_seq_lens(
    model: torch.nn.Module
) -> bool
```

Check if the model's forward() accepts seq\_lens.

Returns True if:

* forward() has an explicit `seq_lens` parameter, OR
* forward() has \*\*kwargs (so it won't crash if seq\_lens is passed)

Returns False otherwise (passing seq\_lens would cause "unexpected kwarg" error).

```python
nemo_automodel.components.utils.model_utils.apply_parameter_freezing(
    model,
    freeze_config
)
```

Apply parameter freezing based on configuration.

**Parameters:**

The model to apply freezing to.

Configuration dict specifying what to freeze.

```python
nemo_automodel.components.utils.model_utils.cast_mixed_dtype_params_to_bf16(
    model
)
```

Cast fp32 parameters and buffers to bf16 for FSDP2 compatibility.

```python
nemo_automodel.components.utils.model_utils.count_model_parameters(
    model: torch.nn.Module
) -> tuple[int, int]
```

Count total and trainable parameters. Safe to call on meta-device models.

**Parameters:**

Model to analyze

**Returns:** `int`

int

```python
nemo_automodel.components.utils.model_utils.enable_radio_vit_fused_attn(
    model
)
```

Route RADIO ViT attention through `F.scaled_dot_product_attention`.

RADIO's timm Attention blocks default to `fused_attn=False`, which
materializes the full `(B, H, seq, seq)` attention tensor (\~5 GiB per
block at RADIO-v2-H + dynamic-resolution patch counts). Flipping
`fused_attn=True` matches the Megatron-Bridge path which sets
`vision_config.use_flash_attn=True` via
`attn_implementation="flash_attention_2"`.

No-op when the model has no RADIO vision tower.

**Parameters:**

The model to patch in place.

```python
nemo_automodel.components.utils.model_utils.filter_forward_kwargs(
    model: torch.nn.Module,
    kwargs: dict
) -> dict
```

Drop kwargs that `model.forward` does not accept.

If the model exposes `**kwargs` or its signature cannot be inspected, the
input kwargs are returned unchanged. The original dict is never mutated.

```python
nemo_automodel.components.utils.model_utils.freeze_deepseek_v4_indexer_params(
    model
)
```

Freeze DeepSeek V4 indexer params that only feed discrete top-k masks.

```python
nemo_automodel.components.utils.model_utils.freeze_minimax_m3_indexer_params(
    model
)
```

Freeze MiniMax M3 lightning-indexer params that only feed discrete top-k masks.

```python
nemo_automodel.components.utils.model_utils.freeze_unused_kv_sharing_params(
    model
)
```

Freeze dead K/V parameters in KV-shared layers.

Models like Gemma4 E2B/E4B use KV-sharing where the last N layers reuse
key/value states from earlier layers. The `k_proj`, `v_proj`,
`k_norm`, and `v_norm` modules still exist in those shared layers but
are never used during forward. Their parameters therefore receive no
gradients, yet the optimizer still tracks them. On checkpoint resume the
distributed checkpoint framework expects optimizer state for every
parameter the optimizer was created with, but zero-gradient params may
have been excluded from the saved state — causing a `RuntimeError`.

Calling this function **before** optimizer creation sets
`requires_grad=False` on the dead parameters so the optimizer never
tracks them, keeping save and load consistent.

**Parameters:**

The model (or pipeline-parallel model part).

```python
nemo_automodel.components.utils.model_utils.get_lm_head_module(
    model: torch.nn.Module
) -> torch.nn.Module | None
```

Return the model's LM head module, if one can be found.

```python
nemo_automodel.components.utils.model_utils.get_lm_head_weight(
    model: torch.nn.Module
) -> torch.Tensor
```

Return the model's LM-head weight, materializing DTensor weights when needed.

```python
nemo_automodel.components.utils.model_utils.init_empty_weights()
```

A context manager under which models are initialized with all parameters on the specified device.

Example:

```python
import torch.nn as nn
from nemo_automodel.components.utils.model_utils import init_empty_weights

with init_empty_weights():
    tst = nn.Linear(100, 100)  # on `cuda` device
```

**Parameters:**

Device to initialize all parameters on.

```python
nemo_automodel.components.utils.model_utils.print_trainable_parameters(
    model: torch.nn.Module,
    name: str = 'Model'
) -> tuple[int, int]
```

Print the number of trainable parameters in the model.

**Parameters:**

Model to analyze

Label for the summary header (e.g. `"Draft"` to distinguish the
draft model from the target in speculative-decoding training).

**Returns:** `int`

int

```python
nemo_automodel.components.utils.model_utils.resolve_trust_remote_code(
    pretrained_model_name_or_path
)
```

Whitelist NVIDIA models to allow remote code execution.

**Parameters:**

The name or path of the pretrained model.

**Returns:**

True if the model should be loaded with trust\_remote\_code, False otherwise.

```python
nemo_automodel.components.utils.model_utils.skip_random_init()
```

Context manager to skip random weight initialization when loading pretrained models.

```python
nemo_automodel.components.utils.model_utils.squeeze_input_for_thd(
    input_ids,
    position_ids,
    padding_mask,
    attn_kwargs,
    seqlens_padding_value = -1000
)
```

Squeeze batch dimension and prepare inputs for THD (total, hidden, depth) format.

This function removes the batch dimension from input tensors and processes attention
kwargs for use with Transformer Engine's THD format. It's typically used when the
batch has already been converted to THD format (with batch\_size=1 as a placeholder
dimension) and that dimension needs to be removed.

The function performs three key operations:

1. Removes the batch dimension (dim 0) from input tensors
2. Filters out padding values from cumulative sequence length tensors
3. Converts max\_seqlen from tensor to scalar if needed

**Parameters:**

Input token IDs with shape \[1, total\_tokens]
or \[1, total\_tokens, hidden\_dim]. The first dimension will be squeezed.
`None` is permitted when the caller is feeding the model via
`inputs_embeds` instead — embeddings are squeezed inside the model
forward (the `squeezed_for_thd` branch in `NemotronHModel.forward`
and analogous code paths), so this helper has nothing to squeeze and
simply returns `None` for the `input_ids` slot.

Position IDs with shape \[1, total\_tokens].
The first dimension will be squeezed.

Padding mask with shape \[1, total\_tokens].
The first dimension will be squeezed.

Dictionary of attention-related tensors. May contain:

* cu\_seqlens: Cumulative sequence lengths \[1, num\_seqs+1]
* cu\_seqlens\_padded: Cumulative padded sequence lengths \[1, num\_seqs+1]
* max\_seqlen: Maximum sequence length (tensor or int)
* Other attention parameters (will be squeezed if tensors)

Sentinel value used to indicate padding in
cu\_seqlens and cu\_seqlens\_padded tensors. These values will be filtered
out. Default: -1000.

**Returns:**

A tuple containing:

* input\_ids (torch.Tensor): Input IDs with batch dimension removed \[total\_tokens]
  or \[total\_tokens, hidden\_dim]
* position\_ids (torch.Tensor): Position IDs with batch dimension removed \[total\_tokens]
* padding\_mask (torch.Tensor): Padding mask with batch dimension removed \[total\_tokens]
* attn\_kwargs (dict): Updated attention kwargs with:
  * Batch dimensions removed from all tensor values
  * Padding values filtered from cu\_seqlens and cu\_seqlens\_padded
  * max\_seqlen converted to scalar if it was a tensor

```python
nemo_automodel.components.utils.model_utils.VLM_INPUT_KEYS: tuple[str, ...] = ('input_ids', 'pixel_values', 'image_flags', 'imgs_sizes', 'image_position_ids',...
```

```python
nemo_automodel.components.utils.model_utils.logger = logging.getLogger(__name__)
```