nemo_automodel.components.models.llava_onevision.rice_vit#

Rice Vision Transformer for LLaVA-OneVision-1.5.

Module Contents#

Classes#

RiceRotaryEmbedding

2D Rotary Position Embedding for Rice ViT.

RicePatchEmbed

Patch embedding layer for Rice ViT.

RicePatchMerger

Merges spatial patches and projects to text hidden size.

RiceMlp

MLP block for Rice ViT.

RiceAttention

Multi-head attention with block-diagonal mask for variable-length images.

RiceBlock

Transformer block for Rice ViT.

Functions#

rotate_half

Rotates half the hidden dims of the input.

apply_rotary_pos_emb_vision

Apply rotary positional embeddings to vision attention.

API#

nemo_automodel.components.models.llava_onevision.rice_vit.rotate_half(x: torch.Tensor) torch.Tensor#

Rotates half the hidden dims of the input.

nemo_automodel.components.models.llava_onevision.rice_vit.apply_rotary_pos_emb_vision(
q: torch.Tensor,
k: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
) Tuple[torch.Tensor, torch.Tensor]#

Apply rotary positional embeddings to vision attention.

class nemo_automodel.components.models.llava_onevision.rice_vit.RiceRotaryEmbedding(dim: int, theta: float = 10000.0)#

Bases: torch.nn.Module

2D Rotary Position Embedding for Rice ViT.

Initialization

forward(seqlen: int) torch.Tensor#
class nemo_automodel.components.models.llava_onevision.rice_vit.RicePatchEmbed(
patch_size: int = 14,
temporal_patch_size: int = 1,
in_channels: int = 3,
embed_dim: int = 1152,
)#

Bases: torch.nn.Module

Patch embedding layer for Rice ViT.

Initialization

forward(hidden_states: torch.Tensor) torch.Tensor#
class nemo_automodel.components.models.llava_onevision.rice_vit.RicePatchMerger(
dim: int,
context_dim: int,
spatial_merge_size: int = 2,
layer_norm_eps: float = 1e-05,
)#

Bases: torch.nn.Module

Merges spatial patches and projects to text hidden size.

Initialization

forward(x: torch.Tensor) torch.Tensor#
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceMlp(dim: int, hidden_dim: int, hidden_act: str)#

Bases: torch.nn.Module

MLP block for Rice ViT.

Initialization

forward(x: torch.Tensor) torch.Tensor#
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceAttention(dim: int, num_heads: int = 16)#

Bases: torch.nn.Module

Multi-head attention with block-diagonal mask for variable-length images.

Initialization

forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.Tensor,
rotary_pos_emb: Optional[torch.Tensor] = None,
position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
) torch.Tensor#
class nemo_automodel.components.models.llava_onevision.rice_vit.RiceBlock(config)#

Bases: torch.nn.Module

Transformer block for Rice ViT.

Initialization

forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.Tensor,
rotary_pos_emb: Optional[torch.Tensor] = None,
position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
) torch.Tensor#