> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.vlm.fake_image

Fake image injection helpers for FSDP / DeepSpeed Zero3.

When a batch contains no images or videos, the visual encoder is not called
during the model forward pass.  In FSDP / DeepSpeed Zero3 every parameter
must participate in the collective all-gather / reduce-scatter; skipping the
visual encoder causes the training to hang.

The fix mirrors LLaMA-Factory's approach: inject a tiny (56x56) white image
into pure-text samples.  The corresponding vision tokens get
`attention_mask = 0` so they are invisible to attention and
`labels = -100` (automatic, because the fake image lives in a *user*
message, never an assistant turn).  This guarantees model correctness while
keeping the visual encoder active.

## Module Contents

### Functions

| Name                                                                                                                            | Description                                                                           |
| ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| [`_batch_has_media`](#nemo_automodel-components-datasets-vlm-fake_image-_batch_has_media)                                       | Return True if any conversation in *conversations* contains an image or video.        |
| [`_conversation_has_media`](#nemo_automodel-components-datasets-vlm-fake_image-_conversation_has_media)                         | Return True if *conversation* (a single list of messages) contains an image or video. |
| [`_get_vision_token_ids`](#nemo_automodel-components-datasets-vlm-fake_image-_get_vision_token_ids)                             | Collect vision token IDs from a processor / tokenizer / config.                       |
| [`_scan_attrs`](#nemo_automodel-components-datasets-vlm-fake_image-_scan_attrs)                                                 | Yield integer token IDs found via `getattr` on *source*.                              |
| [`_warn_no_vision_tokens`](#nemo_automodel-components-datasets-vlm-fake_image-_warn_no_vision_tokens)                           | Log a one-time warning when a processor exposes no recognizable vision tokens.        |
| [`inject_fake_image_into_conversation`](#nemo_automodel-components-datasets-vlm-fake_image-inject_fake_image_into_conversation) | Inject a fake image into a single conversation's first user message.                  |
| [`mask_fake_vision_tokens_batch`](#nemo_automodel-components-datasets-vlm-fake_image-mask_fake_vision_tokens_batch)             | Mask vision tokens in specified batch samples (2D tensors).                           |
| [`mask_fake_vision_tokens_single`](#nemo_automodel-components-datasets-vlm-fake_image-mask_fake_vision_tokens_single)           | Mask vision tokens in a single pre-tokenized sample (1D tensors).                     |

### Data

[`_FAKE_IMAGE`](#nemo_automodel-components-datasets-vlm-fake_image-_FAKE_IMAGE)

[`_VISION_TOKEN_ID_ATTRS`](#nemo_automodel-components-datasets-vlm-fake_image-_VISION_TOKEN_ID_ATTRS)

[`_VISION_TOKEN_KEYWORDS`](#nemo_automodel-components-datasets-vlm-fake_image-_VISION_TOKEN_KEYWORDS)

[`_VISION_TOKEN_STRINGS`](#nemo_automodel-components-datasets-vlm-fake_image-_VISION_TOKEN_STRINGS)

[`_warned_unknown_processors`](#nemo_automodel-components-datasets-vlm-fake_image-_warned_unknown_processors)

[`logger`](#nemo_automodel-components-datasets-vlm-fake_image-logger)

### API

```python
nemo_automodel.components.datasets.vlm.fake_image._batch_has_media(
    conversations
)
```

Return True if any conversation in *conversations* contains an image or video.

```python
nemo_automodel.components.datasets.vlm.fake_image._conversation_has_media(
    conversation
)
```

Return True if *conversation* (a single list of messages) contains an image or video.

```python
nemo_automodel.components.datasets.vlm.fake_image._get_vision_token_ids(
    processor
)
```

Collect vision token IDs from a processor / tokenizer / config.

Walks three sources to be robust across VLM families:

1. Known attribute names on the processor *and* its `config` (Gemma4,
   LLaVA put the IDs on the config rather than the processor).
2. A curated list of vision token strings looked up via
   `tokenizer.convert_tokens_to_ids`.
3. A keyword-based fuzzy scan of `tokenizer.added_tokens_decoder` so
   custom or future VLMs are picked up automatically.

```python
nemo_automodel.components.datasets.vlm.fake_image._scan_attrs(
    source,
    attr_names
)
```

Yield integer token IDs found via `getattr` on *source*.

```python
nemo_automodel.components.datasets.vlm.fake_image._warn_no_vision_tokens(
    processor
) -> None
```

Log a one-time warning when a processor exposes no recognizable vision tokens.

Without this warning a fake-image injection silently leaves the vision
tokens visible to attention, which can degrade training quality without
any other observable symptom.

```python
nemo_automodel.components.datasets.vlm.fake_image.inject_fake_image_into_conversation(
    conversation
)
```

Inject a fake image into a single conversation's first user message.

Returns a deep-copied conversation so the original is never mutated.

```python
nemo_automodel.components.datasets.vlm.fake_image.mask_fake_vision_tokens_batch(
    batch,
    processor,
    sample_indices
)
```

Mask vision tokens in specified batch samples (2D tensors).

Sets `attention_mask = 0` for every vision token in the given
*sample\_indices* of the batch.

```python
nemo_automodel.components.datasets.vlm.fake_image.mask_fake_vision_tokens_single(
    sample_dict,
    processor
)
```

Mask vision tokens in a single pre-tokenized sample (1D tensors).

Sets `attention_mask = 0` for every vision token in *sample\_dict*.
This is used at `__getitem__` time for pre-tokenized datasets.

```python
nemo_automodel.components.datasets.vlm.fake_image._FAKE_IMAGE = PILImage.new('RGB', (56, 56), (255, 255, 255))
```

```python
nemo_automodel.components.datasets.vlm.fake_image._VISION_TOKEN_ID_ATTRS = ('image_token_id', 'video_token_id', 'image_token_index', 'video_token_index', '...
```

```python
nemo_automodel.components.datasets.vlm.fake_image._VISION_TOKEN_KEYWORDS = ('image', 'video', 'media', 'vision', 'img_pad', 'vid_pad')
```

```python
nemo_automodel.components.datasets.vlm.fake_image._VISION_TOKEN_STRINGS = ('<|vision_start|>', '<|vision_end|>', '<|image_pad|>', '<|video_pad|>', '<|medi...
```

```python
nemo_automodel.components.datasets.vlm.fake_image._warned_unknown_processors: set[str] = set()
```

```python
nemo_automodel.components.datasets.vlm.fake_image.logger = logging.getLogger(__name__)
```