> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.step3p7.vision_encoder

## Module Contents

### Classes

| Name                                                                                                              | Description                                                         |
| ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- |
| [`EncoderLayerScale`](#nemo_automodel-components-models-step3p7-vision_encoder-EncoderLayerScale)                 | Per-channel residual scaling used when ls\_init\_value is set.      |
| [`EncoderMLP`](#nemo_automodel-components-models-step3p7-vision_encoder-EncoderMLP)                               | Feed-forward network used inside each transformer block.            |
| [`EncoderRope2D`](#nemo_automodel-components-models-step3p7-vision_encoder-EncoderRope2D)                         | Cacheable 2D rotary positional embedding.                           |
| [`EncoderVisionAttention`](#nemo_automodel-components-models-step3p7-vision_encoder-EncoderVisionAttention)       | Multi-head self attention with optional 2D RoPE.                    |
| [`EncoderVisionBlock`](#nemo_automodel-components-models-step3p7-vision_encoder-EncoderVisionBlock)               | A single Vision Transformer block (self-attention + MLP).           |
| [`EncoderVisionTransformer`](#nemo_automodel-components-models-step3p7-vision_encoder-EncoderVisionTransformer)   | Stack of encoder blocks parameterised by Step35VisionEncoderConfig. |
| [`StepRoboticsVisionEncoder`](#nemo_automodel-components-models-step3p7-vision_encoder-StepRoboticsVisionEncoder) | Vision encoder built from StepRoboticsVisionEncoderConfig.          |

### Functions

| Name                                                                                            | Description                                   |
| ----------------------------------------------------------------------------------------------- | --------------------------------------------- |
| [`apply_rotary_emb`](#nemo_automodel-components-models-step3p7-vision_encoder-apply_rotary_emb) | Apply 2D rotary embeddings to queries / keys. |
| [`rotate_half`](#nemo_automodel-components-models-step3p7-vision_encoder-rotate_half)           | Rotate last dimension halves (used by RoPE).  |

### API

```python
class nemo_automodel.components.models.step3p7.vision_encoder.EncoderLayerScale(
    dim: int,
    init_values: float
)
```

**Bases:** `Module`

Per-channel residual scaling used when ls\_init\_value is set.

```python
nemo_automodel.components.models.step3p7.vision_encoder.EncoderLayerScale.forward(
    hidden_states: torch.Tensor
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.step3p7.vision_encoder.EncoderMLP(
    hidden_size: int,
    intermediate_size: int,
    hidden_act: str
)
```

**Bases:** `Module`

Feed-forward network used inside each transformer block.

```python
nemo_automodel.components.models.step3p7.vision_encoder.EncoderMLP.forward(
    hidden_states: torch.Tensor
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.step3p7.vision_encoder.EncoderRope2D(
    dim: int,
    max_grid_height: int,
    max_grid_width: int,
    use_cls_token: bool = False,
    theta: typing.Union[int, float] = 10000,
    max_freq: int = 10,
    num_freqs: int = 1,
    theta_rescale_factor: float = 1.0
)
```

**Bases:** `Module`

Cacheable 2D rotary positional embedding.

```python
nemo_automodel.components.models.step3p7.vision_encoder.EncoderRope2D._compute_2d_freqs() -> torch.Tensor
```

```python
nemo_automodel.components.models.step3p7.vision_encoder.EncoderRope2D._compute_freqs(
    t: torch.Tensor,
    inv_freq: torch.Tensor
)
```

```python
nemo_automodel.components.models.step3p7.vision_encoder.EncoderRope2D._compute_inv_freq(
    base: typing.Union[int, float],
    dim: int
) -> torch.Tensor
```

```python
nemo_automodel.components.models.step3p7.vision_encoder.EncoderRope2D.forward(
    q: torch.Tensor,
    k: torch.Tensor,
    grid_hw: tuple[int, int]
)
```

```python
class nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionAttention(
    hidden_size: int,
    num_heads: int,
    max_grid_height: int,
    max_grid_width: int,
    use_cls_token: bool = False,
    use_rope2d: bool = True,
    rope_theta: typing.Union[int, float] = 10000,
    rope_max_freq: int = 10,
    rope_num_freqs: int = 1,
    rope_theta_rescale_factor: float = 1.0,
    rope_freqs_for: typing.Literal['lang', 'pixel', 'constant'] = 'lang'
)
```

**Bases:** `Module`

Multi-head self attention with optional 2D RoPE.

```python
nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionAttention.forward(
    hidden_states: torch.Tensor,
    grid_hw: tuple[int, int]
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionBlock(
    hidden_size: int,
    num_heads: int,
    mlp_ratio: float,
    hidden_act: str,
    layer_norm_eps: float,
    ls_init_value: typing.Optional[float] = None,
    max_grid_height: typing.Optional[int] = None,
    max_grid_width: typing.Optional[int] = None,
    use_cls_token: bool = False,
    use_rope2d: bool = True,
    rope_kwargs: typing.Optional[dict] = None
)
```

**Bases:** `Module`

A single Vision Transformer block (self-attention + MLP).

```python
nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionBlock.forward(
    hidden_states: torch.Tensor,
    grid_hw: tuple[int, int]
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionTransformer(
    embed_dim: int,
    depth: int,
    num_heads: int,
    mlp_ratio: float,
    hidden_act: str,
    layer_norm_eps: float,
    ls_init_value: typing.Optional[float] = None,
    max_grid_height: typing.Optional[int] = None,
    max_grid_width: typing.Optional[int] = None,
    use_cls_token: bool = False,
    use_rope2d: bool = True,
    rope_kwargs: typing.Optional[dict] = None
)
```

**Bases:** `Module`

Stack of encoder blocks parameterised by Step35VisionEncoderConfig.

```python
nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionTransformer.forward(
    hidden_states: torch.Tensor,
    grid_hw: tuple[int, int]
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.step3p7.vision_encoder.StepRoboticsVisionEncoder(
    config: nemo_automodel.components.models.step3p7.configuration_step3p7.StepRoboticsVisionEncoderConfig
)
```

**Bases:** `Module`

Vision encoder built from StepRoboticsVisionEncoderConfig.

The encoder performs patch embedding followed by a stack of transformer
blocks. Only the config fields defined in StepRoboticsVisionEncoderConfig (and
StepRoboticVLConfig.vision\_config) are expected.

```python
nemo_automodel.components.models.step3p7.vision_encoder.StepRoboticsVisionEncoder.forward(
    pixel_values: torch.Tensor
) -> torch.Tensor
```

**Parameters:**

Image tensor of shape (B, C, H, W).

Negative indices stop after a given block (e.g., -1 uses all blocks).

If True and cls token is used, remove it from output.

```python
nemo_automodel.components.models.step3p7.vision_encoder.StepRoboticsVisionEncoder.sample_abs_posemb(
    grid_h: int,
    grid_w: int
)
```

```python
nemo_automodel.components.models.step3p7.vision_encoder.apply_rotary_emb(
    freqs: torch.Tensor,
    t: torch.Tensor,
    start_index: int = 0,
    scale: float = 1.0,
    seq_dim: int = -2
) -> torch.Tensor
```

Apply 2D rotary embeddings to queries / keys.

```python
nemo_automodel.components.models.step3p7.vision_encoder.rotate_half(
    x: torch.Tensor
) -> torch.Tensor
```

Rotate last dimension halves (used by RoPE).