> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.llava_onevision.model

LLaVA-OneVision-1.5 model implementation.

Matches the layout of lmms-lab/LLaVA-OneVision-1.5-\*-\{Base,Instruct} so that
HF safetensors load into this module tree via LlavaOneVisionStateDictAdapter
with only regex-renames (no tensor transforms).

## Module Contents

### Classes

| Name                                                                                                                                               | Description                                                             |
| -------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- |
| [`LLaVAOneVision1_5_ForConditionalGeneration`](#nemo_automodel-components-models-llava_onevision-model-LLaVAOneVision1_5_ForConditionalGeneration) | LLaVA-OneVision-1.5 for conditional generation (Rice ViT + Qwen3 text). |
| [`LLaVAOneVision1_5_Model`](#nemo_automodel-components-models-llava_onevision-model-LLaVAOneVision1_5_Model)                                       | Combined vision + language backbone. Returns last\_hidden\_state.       |
| [`Llavaonevision1_5Config`](#nemo_automodel-components-models-llava_onevision-model-Llavaonevision1_5Config)                                       | Top-level config for LLaVA-OneVision-1.5.                               |
| [`RiceConfig`](#nemo_automodel-components-models-llava_onevision-model-RiceConfig)                                                                 | Configuration for the Rice ViT vision tower.                            |

### Functions

| Name                                                                                                     | Description                                                       |
| -------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- |
| [`_build_text_config`](#nemo_automodel-components-models-llava_onevision-model-_build_text_config)       | Coerce a text\_config dict from HF (or user) into a Qwen3Config.  |
| [`_coerce_text_config`](#nemo_automodel-components-models-llava_onevision-model-_coerce_text_config)     | Accept a raw HF remote-code text config and return a Qwen3Config. |
| [`_coerce_vision_config`](#nemo_automodel-components-models-llava_onevision-model-_coerce_vision_config) | -                                                                 |

### Data

[`LOGGER`](#nemo_automodel-components-models-llava_onevision-model-LOGGER)

### API

```python
class nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration(
    config,
    attn_implementation: typing.Optional[str] = None,
    kwargs = {}
)
```

**Bases:** [HFCheckpointingMixin](/nemo-automodel/nemo_automodel/components/models/common/hf_checkpointing_mixin#nemo_automodel-components-models-common-hf_checkpointing_mixin-HFCheckpointingMixin), `Module`

LLaVA-OneVision-1.5 for conditional generation (Rice ViT + Qwen3 text).

```python
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration.forward(
    input_ids: typing.Optional[torch.LongTensor] = None,
    attention_mask: typing.Optional[torch.Tensor] = None,
    position_ids: typing.Optional[torch.LongTensor] = None,
    past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None,
    inputs_embeds: typing.Optional[torch.FloatTensor] = None,
    labels: typing.Optional[torch.LongTensor] = None,
    use_cache: typing.Optional[bool] = None,
    pixel_values: typing.Optional[torch.FloatTensor] = None,
    pixel_values_videos: typing.Optional[torch.FloatTensor] = None,
    image_grid_thw: typing.Optional[torch.LongTensor] = None,
    video_grid_thw: typing.Optional[torch.LongTensor] = None,
    output_hidden_states: typing.Optional[bool] = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    kwargs = {}
) -> typing.Union[typing.Tuple, transformers.modeling_outputs.CausalLMOutputWithPast]
```

```python
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration.from_config(
    config,
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration.get_input_embeddings()
```

```python
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration.get_output_embeddings()
```

```python
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration.set_input_embeddings(
    value
)
```

```python
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration.set_output_embeddings(
    new_embeddings
)
```

```python
class nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_Model(
    config: nemo_automodel.components.models.llava_onevision.model.Llavaonevision1_5Config,
    attn_implementation: str = 'eager'
)
```

**Bases:** `Module`

Combined vision + language backbone. Returns last\_hidden\_state.

```python
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_Model.forward(
    input_ids: typing.Optional[torch.LongTensor] = None,
    attention_mask: typing.Optional[torch.Tensor] = None,
    position_ids: typing.Optional[torch.LongTensor] = None,
    past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None,
    inputs_embeds: typing.Optional[torch.FloatTensor] = None,
    pixel_values: typing.Optional[torch.FloatTensor] = None,
    pixel_values_videos: typing.Optional[torch.FloatTensor] = None,
    image_grid_thw: typing.Optional[torch.LongTensor] = None,
    video_grid_thw: typing.Optional[torch.LongTensor] = None,
    use_cache: typing.Optional[bool] = None,
    kwargs = {}
)
```

```python
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_Model.get_image_features(
    pixel_values: torch.FloatTensor,
    image_grid_thw: torch.LongTensor
) -> torch.Tensor
```

```python
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_Model.get_input_embeddings()
```

```python
nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_Model.set_input_embeddings(
    value
)
```

```python
class nemo_automodel.components.models.llava_onevision.model.Llavaonevision1_5Config(
    text_config: typing.Optional[typing.Union[typing.Dict, transformers.configuration_utils.PretrainedConfig]] = None,
    vision_config: typing.Optional[typing.Union[typing.Dict, nemo_automodel.components.models.llava_onevision.model.RiceConfig]] = None,
    image_token_id: int = 151655,
    video_token_id: int = 151656,
    vision_start_token_id: int = 151652,
    vision_end_token_id: int = 151653,
    vocab_size: int = 152064,
    architectures: typing.Optional[typing.List[str]] = None,
    kwargs = {}
)
```

**Bases:** `PretrainedConfig`

Top-level config for LLaVA-OneVision-1.5.

`model_type` matches the on-hub value exactly so `AutoConfig.from_pretrained`
resolves to this class without `trust_remote_code` once registered.

```python
nemo_automodel.components.models.llava_onevision.model.Llavaonevision1_5Config.to_dict() -> typing.Dict[str, typing.Any]
```

```python
class nemo_automodel.components.models.llava_onevision.model.RiceConfig(
    depth: int = 24,
    embed_dim: int = 1024,
    hidden_size: int = 1024,
    hidden_act: str = 'gelu',
    intermediate_size: int = 4096,
    num_heads: int = 16,
    in_channels: int = 3,
    patch_size: int = 14,
    spatial_merge_size: int = 2,
    temporal_patch_size: int = 1,
    initializer_range: float = 0.02,
    layer_norm_eps: float = 1e-05,
    text_hidden_size: int = 2560,
    kwargs = {}
)
```

**Bases:** `PretrainedConfig`

Configuration for the Rice ViT vision tower.

```python
nemo_automodel.components.models.llava_onevision.model._build_text_config(
    data: typing.Dict[str, typing.Any]
) -> transformers.configuration_utils.PretrainedConfig
```

Coerce a text\_config dict from HF (or user) into a Qwen3Config.

LLaVA-OV-1.5's text backbone is Qwen3 (q/k norm, GQA, standard SiLU MLP).
On-hub `model_type` is `LLaVAOneVision1_5_text`; we drop it so Qwen3Config
doesn't reject the kwargs.

```python
nemo_automodel.components.models.llava_onevision.model._coerce_text_config(
    tc: typing.Any
) -> transformers.configuration_utils.PretrainedConfig
```

Accept a raw HF remote-code text config and return a Qwen3Config.

The constructor path for NeMo custom models is `cls(hf_config)` where
`hf_config` may be the remote-code `Llavaonevision1_5Config` whose
`text_config` is a `LLaVAOneVision1_5_TextConfig` instance. Normalize
to Qwen3Config so the inner `Qwen3Model` gets fields it understands.

```python
nemo_automodel.components.models.llava_onevision.model._coerce_vision_config(
    vc: typing.Any
) -> nemo_automodel.components.models.llava_onevision.model.RiceConfig
```

```python
nemo_automodel.components.models.llava_onevision.model.LOGGER = logging.getLogger(__name__)
```