nemo_automodel.components.models.llava_onevision.rice_vit#

Rice Vision Transformer for LLaVA-OneVision-1.5.

Ported from lmms-lab/LLaVA-OneVision-1.5’s modeling_llavaonevision1_5.py.

Module Contents#

Classes#

RiceRotaryEmbedding

RicePatchEmbed

RicePatchMerger

RiceMlp

RiceAttention

Eager block-diagonal attention over variable-length image segments.

RiceFlashAttention2

Flash-attention-2 variant using flash_attn_varlen_func (requires flash_attn).

RiceSdpaAttention

SDPA variant with an additive block-diagonal mask.

RiceBlock

RiceTransformer

Rice ViT with per-image class-token insertion and block-diagonal attention.

Functions#

Data#

API#

nemo_automodel.components.models.llava_onevision.rice_vit.rotate_half(x: torch.Tensor) torch.Tensor#
nemo_automodel.components.models.llava_onevision.rice_vit.apply_rotary_pos_emb_vision(
q: torch.Tensor,
k: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
) Tuple[torch.Tensor, torch.Tensor]#
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceRotaryEmbedding(dim: int, theta: float = 10000.0)#

Bases: torch.nn.Module

Initialization

forward(seqlen: int) torch.Tensor#
class nemo_automodel.components.models.llava_onevision.rice_vit.RicePatchEmbed(
patch_size: int = 14,
temporal_patch_size: int = 1,
in_channels: int = 3,
embed_dim: int = 1024,
)#

Bases: torch.nn.Module

Initialization

forward(hidden_states: torch.Tensor) torch.Tensor#
class nemo_automodel.components.models.llava_onevision.rice_vit.RicePatchMerger(
dim: int,
context_dim: int,
spatial_merge_size: int = 2,
layer_norm_eps: float = 1e-05,
)#

Bases: torch.nn.Module

Initialization

forward(x: torch.Tensor) torch.Tensor#
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceMlp(dim: int, hidden_dim: int, hidden_act: str)#

Bases: torch.nn.Module

Initialization

forward(x: torch.Tensor) torch.Tensor#
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceAttention(dim: int, num_heads: int = 16)#

Bases: torch.nn.Module

Eager block-diagonal attention over variable-length image segments.

Initialization

forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.Tensor,
position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
) torch.Tensor#
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceFlashAttention2(dim: int, num_heads: int = 16)#

Bases: torch.nn.Module

Flash-attention-2 variant using flash_attn_varlen_func (requires flash_attn).

Initialization

forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.Tensor,
position_embeddings: Tuple[torch.Tensor, torch.Tensor],
) torch.Tensor#
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceSdpaAttention(dim: int, num_heads: int = 16)#

Bases: torch.nn.Module

SDPA variant with an additive block-diagonal mask.

Initialization

forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.Tensor,
position_embeddings: Tuple[torch.Tensor, torch.Tensor],
) torch.Tensor#
nemo_automodel.components.models.llava_onevision.rice_vit._ATTENTION_CLASSES#

None

class nemo_automodel.components.models.llava_onevision.rice_vit.RiceBlock(config, attn_implementation: str = 'eager')#

Bases: torch.nn.Module

Initialization

forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.Tensor,
position_embeddings: Tuple[torch.Tensor, torch.Tensor],
) torch.Tensor#
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceTransformer(config, attn_implementation: str = 'eager')#

Bases: torch.nn.Module

Rice ViT with per-image class-token insertion and block-diagonal attention.

Matches the HF reference: one CLS token is prepended at the start of each image segment inside the flat packed sequence, and the attention mask is built from a cu_seqlens that accounts for the extra CLS per segment.

Initialization

property dtype: torch.dtype#
rot_pos_emb(grid_thw: torch.Tensor) torch.Tensor#
forward(
pixel_values: torch.Tensor,
grid_thw: torch.Tensor,
) torch.Tensor#