nemo_automodel.components.datasets.vlm.fake_image#

Fake image injection helpers for FSDP / DeepSpeed Zero3.

When a batch contains no images or videos, the visual encoder is not called during the model forward pass. In FSDP / DeepSpeed Zero3 every parameter must participate in the collective all-gather / reduce-scatter; skipping the visual encoder causes the training to hang.

The fix mirrors LLaMA-Factory’s approach: inject a tiny (56x56) white image into pure-text samples. The corresponding vision tokens get attention_mask = 0 so they are invisible to attention and labels = -100 (automatic, because the fake image lives in a user message, never an assistant turn). This guarantees model correctness while keeping the visual encoder active.

Module Contents#

Functions#

_conversation_has_media

Return True if conversation (a single list of messages) contains an image or video.

_batch_has_media

Return True if any conversation in conversations contains an image or video.

inject_fake_image_into_conversation

Inject a fake image into a single conversation’s first user message.

_get_vision_token_ids

Collect vision token IDs from a processor/tokenizer.

mask_fake_vision_tokens_single

Mask vision tokens in a single pre-tokenized sample (1D tensors).

mask_fake_vision_tokens_batch

Mask vision tokens in specified batch samples (2D tensors).

Data#

API#

nemo_automodel.components.datasets.vlm.fake_image._FAKE_IMAGE#

‘new(…)’

nemo_automodel.components.datasets.vlm.fake_image._conversation_has_media(conversation)#

Return True if conversation (a single list of messages) contains an image or video.

nemo_automodel.components.datasets.vlm.fake_image._batch_has_media(conversations)#

Return True if any conversation in conversations contains an image or video.

nemo_automodel.components.datasets.vlm.fake_image.inject_fake_image_into_conversation(conversation)#

Inject a fake image into a single conversation’s first user message.

Returns a deep-copied conversation so the original is never mutated.

nemo_automodel.components.datasets.vlm.fake_image._get_vision_token_ids(processor)#

Collect vision token IDs from a processor/tokenizer.

nemo_automodel.components.datasets.vlm.fake_image.mask_fake_vision_tokens_single(sample_dict, processor)#

Mask vision tokens in a single pre-tokenized sample (1D tensors).

Sets attention_mask = 0 for every vision token in sample_dict. This is used at __getitem__ time for pre-tokenized datasets.

nemo_automodel.components.datasets.vlm.fake_image.mask_fake_vision_tokens_batch(batch, processor, sample_indices)#

Mask vision tokens in specified batch samples (2D tensors).

Sets attention_mask = 0 for every vision token in the given sample_indices of the batch.