nemo_automodel.components.models.llava_onevision.model

View as Markdown

LLaVA-OneVision-1.5 model implementation.

Matches the layout of lmms-lab/LLaVA-OneVision-1.5-*-{Base,Instruct} so that HF safetensors load into this module tree via LlavaOneVisionStateDictAdapter with only regex-renames (no tensor transforms).

Module Contents

Classes

NameDescription
LLaVAOneVision1_5_ForConditionalGenerationLLaVA-OneVision-1.5 for conditional generation (Rice ViT + Qwen3 text).
LLaVAOneVision1_5_ModelCombined vision + language backbone. Returns last_hidden_state.
Llavaonevision1_5ConfigTop-level config for LLaVA-OneVision-1.5.
RiceConfigConfiguration for the Rice ViT vision tower.

Functions

NameDescription
_build_text_configCoerce a text_config dict from HF (or user) into a Qwen3Config.
_coerce_text_configAccept a raw HF remote-code text config and return a Qwen3Config.
_coerce_vision_config-

Data

LOGGER

API

class nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration(
config,
attn_implementation: typing.Optional[str] = None,
kwargs = {}
)

Bases: HFCheckpointingMixin, Module

LLaVA-OneVision-1.5 for conditional generation (Rice ViT + Qwen3 text).

image_token_id
= getattr(config, 'image_token_id', 151655)
lm_head
model
state_dict_adapter
= LlavaOneVisionStateDictAdapter(config)
video_token_id
= getattr(config, 'video_token_id', 151656)
vocab_size
= self.model.text_config.vocab_size
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration.forward(
input_ids: typing.Optional[torch.LongTensor] = None,
attention_mask: typing.Optional[torch.Tensor] = None,
position_ids: typing.Optional[torch.LongTensor] = None,
past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None,
inputs_embeds: typing.Optional[torch.FloatTensor] = None,
labels: typing.Optional[torch.LongTensor] = None,
use_cache: typing.Optional[bool] = None,
pixel_values: typing.Optional[torch.FloatTensor] = None,
pixel_values_videos: typing.Optional[torch.FloatTensor] = None,
image_grid_thw: typing.Optional[torch.LongTensor] = None,
video_grid_thw: typing.Optional[torch.LongTensor] = None,
output_hidden_states: typing.Optional[bool] = None,
logits_to_keep: typing.Union[int, torch.Tensor] = 0,
kwargs = {}
) -> typing.Union[typing.Tuple, transformers.modeling_outputs.CausalLMOutputWithPast]
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration.from_config(
config,
kwargs = {}
)
classmethod
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration.from_pretrained(
pretrained_model_name_or_path: str,
model_args = (),
kwargs = {}
)
classmethod
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration.get_input_embeddings()
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration.get_output_embeddings()
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration.set_input_embeddings(
value
)
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration.set_output_embeddings(
new_embeddings
)
class nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_Model(
config: nemo_automodel.components.models.llava_onevision.model.Llavaonevision1_5Config,
attn_implementation: str = 'eager'
)

Bases: Module

Combined vision + language backbone. Returns last_hidden_state.

language_model
= Qwen3Model(self.text_config)
text_config
= _coerce_text_config(config.text_config)
vision_config
= _coerce_vision_config(config.vision_config)
visual
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_Model.forward(
input_ids: typing.Optional[torch.LongTensor] = None,
attention_mask: typing.Optional[torch.Tensor] = None,
position_ids: typing.Optional[torch.LongTensor] = None,
past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None,
inputs_embeds: typing.Optional[torch.FloatTensor] = None,
pixel_values: typing.Optional[torch.FloatTensor] = None,
pixel_values_videos: typing.Optional[torch.FloatTensor] = None,
image_grid_thw: typing.Optional[torch.LongTensor] = None,
video_grid_thw: typing.Optional[torch.LongTensor] = None,
use_cache: typing.Optional[bool] = None,
kwargs = {}
)
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_Model.get_image_features(
pixel_values: torch.FloatTensor,
image_grid_thw: torch.LongTensor
) -> torch.Tensor
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_Model.get_input_embeddings()
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_Model.set_input_embeddings(
value
)
class nemo_automodel.components.models.llava_onevision.model.Llavaonevision1_5Config(
text_config: typing.Optional[typing.Union[typing.Dict, transformers.configuration_utils.PretrainedConfig]] = None,
vision_config: typing.Optional[typing.Union[typing.Dict, nemo_automodel.components.models.llava_onevision.model.RiceConfig]] = None,
image_token_id: int = 151655,
video_token_id: int = 151656,
vision_start_token_id: int = 151652,
vision_end_token_id: int = 151653,
vocab_size: int = 152064,
architectures: typing.Optional[typing.List[str]] = None,
kwargs = {}
)

Bases: PretrainedConfig

Top-level config for LLaVA-OneVision-1.5.

model_type matches the on-hub value exactly so AutoConfig.from_pretrained resolves to this class without trust_remote_code once registered.

model_type
= 'llavaonevision1_5'
sub_configs
= {'vision_config': RiceConfig}
text_config
= _build_text_config(text_config)
vision_config
= RiceConfig(**vision_config)
nemo_automodel.components.models.llava_onevision.model.Llavaonevision1_5Config.to_dict() -> typing.Dict[str, typing.Any]
class nemo_automodel.components.models.llava_onevision.model.RiceConfig(
depth: int = 24,
embed_dim: int = 1024,
hidden_size: int = 1024,
hidden_act: str = 'gelu',
intermediate_size: int = 4096,
num_heads: int = 16,
in_channels: int = 3,
patch_size: int = 14,
spatial_merge_size: int = 2,
temporal_patch_size: int = 1,
initializer_range: float = 0.02,
layer_norm_eps: float = 1e-05,
text_hidden_size: int = 2560,
kwargs = {}
)

Bases: PretrainedConfig

Configuration for the Rice ViT vision tower.

base_config_key
= 'vision_config'
model_type
= 'rice_vit'
nemo_automodel.components.models.llava_onevision.model._build_text_config(
data: typing.Dict[str, typing.Any]
) -> transformers.configuration_utils.PretrainedConfig

Coerce a text_config dict from HF (or user) into a Qwen3Config.

LLaVA-OV-1.5’s text backbone is Qwen3 (q/k norm, GQA, standard SiLU MLP). On-hub model_type is LLaVAOneVision1_5_text; we drop it so Qwen3Config doesn’t reject the kwargs.

nemo_automodel.components.models.llava_onevision.model._coerce_text_config(
tc: typing.Any
) -> transformers.configuration_utils.PretrainedConfig

Accept a raw HF remote-code text config and return a Qwen3Config.

The constructor path for NeMo custom models is cls(hf_config) where hf_config may be the remote-code Llavaonevision1_5Config whose text_config is a LLaVAOneVision1_5_TextConfig instance. Normalize to Qwen3Config so the inner Qwen3Model gets fields it understands.

nemo_automodel.components.models.llava_onevision.model._coerce_vision_config(
vc: typing.Any
) -> nemo_automodel.components.models.llava_onevision.model.RiceConfig
nemo_automodel.components.models.llava_onevision.model.LOGGER = logging.getLogger(__name__)