nemo_automodel.components.models.llava_onevision.rice_vit

View as Markdown

Rice Vision Transformer for LLaVA-OneVision-1.5.

Ported from lmms-lab/LLaVA-OneVision-1.5’s modeling_llavaonevision1_5.py.

Module Contents

Classes

NameDescription
RiceAttentionEager block-diagonal attention over variable-length image segments.
RiceBlock-
RiceFlashAttention2Flash-attention-2 variant using flash_attn_varlen_func (requires flash_attn).
RiceMlp-
RicePatchEmbed-
RicePatchMerger-
RiceRotaryEmbedding-
RiceSdpaAttentionSDPA variant with an additive block-diagonal mask.
RiceTransformerRice ViT with per-image class-token insertion and block-diagonal attention.

Functions

Data

_ATTENTION_CLASSES

API

class nemo_automodel.components.models.llava_onevision.rice_vit.RiceAttention(
dim: int,
num_heads: int = 16
)

Bases: Module

Eager block-diagonal attention over variable-length image segments.

head_dim
= dim // num_heads
proj
= nn.Linear(dim, dim)
qkv
= nn.Linear(dim, dim * 3, bias=True)
nemo_automodel.components.models.llava_onevision.rice_vit.RiceAttention.forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.Tensor,
position_embeddings: typing.Optional[typing.Tuple[torch.Tensor, torch.Tensor]] = None
) -> torch.Tensor
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceBlock(
config,
attn_implementation: str = 'eager'
)

Bases: Module

attn
mlp
norm1
norm2
nemo_automodel.components.models.llava_onevision.rice_vit.RiceBlock.forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.Tensor,
position_embeddings: typing.Tuple[torch.Tensor, torch.Tensor]
) -> torch.Tensor
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceFlashAttention2(
dim: int,
num_heads: int = 16
)

Bases: Module

Flash-attention-2 variant using flash_attn_varlen_func (requires flash_attn).

head_dim
= dim // num_heads
proj
= nn.Linear(dim, dim)
qkv
= nn.Linear(dim, dim * 3, bias=True)
nemo_automodel.components.models.llava_onevision.rice_vit.RiceFlashAttention2.forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.Tensor,
position_embeddings: typing.Tuple[torch.Tensor, torch.Tensor]
) -> torch.Tensor
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceMlp(
dim: int,
hidden_dim: int,
hidden_act: str
)

Bases: Module

act
= ACT2FN[hidden_act]
fc1
= nn.Linear(dim, hidden_dim)
fc2
= nn.Linear(hidden_dim, dim)
nemo_automodel.components.models.llava_onevision.rice_vit.RiceMlp.forward(
x: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.llava_onevision.rice_vit.RicePatchEmbed(
patch_size: int = 14,
temporal_patch_size: int = 1,
in_channels: int = 3,
embed_dim: int = 1024
)

Bases: Module

proj
temporal_patch_size
= 1
nemo_automodel.components.models.llava_onevision.rice_vit.RicePatchEmbed.forward(
hidden_states: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.llava_onevision.rice_vit.RicePatchMerger(
dim: int,
context_dim: int,
spatial_merge_size: int = 2,
layer_norm_eps: float = 1e-05
)

Bases: Module

hidden_size
= context_dim * spatial_merge_size ** 2
ln_q
= LayerNorm(context_dim, eps=layer_norm_eps)
mlp
nemo_automodel.components.models.llava_onevision.rice_vit.RicePatchMerger.forward(
x: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceRotaryEmbedding(
dim: int,
theta: float = 10000.0
)

Bases: Module

nemo_automodel.components.models.llava_onevision.rice_vit.RiceRotaryEmbedding.forward(
seqlen: int
) -> torch.Tensor
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceSdpaAttention(
dim: int,
num_heads: int = 16
)

Bases: Module

SDPA variant with an additive block-diagonal mask.

head_dim
= dim // num_heads
proj
= nn.Linear(dim, dim)
qkv
= nn.Linear(dim, dim * 3, bias=True)
nemo_automodel.components.models.llava_onevision.rice_vit.RiceSdpaAttention.forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.Tensor,
position_embeddings: typing.Tuple[torch.Tensor, torch.Tensor]
) -> torch.Tensor
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceTransformer(
config,
attn_implementation: str = 'eager'
)

Bases: Module

Rice ViT with per-image class-token insertion and block-diagonal attention.

Matches the HF reference: one CLS token is prepended at the start of each image segment inside the flat packed sequence, and the attention mask is built from a cu_seqlens that accounts for the extra CLS per segment.

blocks
class_embedding
class_pos_emb
= nn.Parameter(torch.randn(1, head_dim // 2))
dtype
dtype
merger
patch_embed
patch_size
= config.patch_size
pre_layernorm
rotary_pos_emb
= RiceRotaryEmbedding(head_dim // 2)
spatial_merge_size
= config.spatial_merge_size
nemo_automodel.components.models.llava_onevision.rice_vit.RiceTransformer.forward(
pixel_values: torch.Tensor,
grid_thw: torch.Tensor
) -> torch.Tensor
nemo_automodel.components.models.llava_onevision.rice_vit.RiceTransformer.rot_pos_emb(
grid_thw: torch.Tensor
) -> torch.Tensor
nemo_automodel.components.models.llava_onevision.rice_vit.apply_rotary_pos_emb_vision(
q: torch.Tensor,
k: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor
) -> typing.Tuple[torch.Tensor, torch.Tensor]
nemo_automodel.components.models.llava_onevision.rice_vit.rotate_half(
x: torch.Tensor
) -> torch.Tensor
nemo_automodel.components.models.llava_onevision.rice_vit._ATTENTION_CLASSES = {'eager': RiceAttention, 'sdpa': RiceSdpaAttention, 'flash_attention_2': RiceFla...