nemo_automodel.components.datasets.vlm.fake_image

Fake image injection helpers for FSDP / DeepSpeed Zero3.

When a batch contains no images or videos, the visual encoder is not called during the model forward pass. In FSDP / DeepSpeed Zero3 every parameter must participate in the collective all-gather / reduce-scatter; skipping the visual encoder causes the training to hang.

The fix mirrors LLaMA-Factory’s approach: inject a tiny (56x56) white image into pure-text samples. The corresponding vision tokens get attention_mask = 0 so they are invisible to attention and labels = -100 (automatic, because the fake image lives in a user message, never an assistant turn). This guarantees model correctness while keeping the visual encoder active.

Module Contents

Functions

Name	Description
`_batch_has_media`	Return True if any conversation in conversations contains an image or video.
`_conversation_has_media`	Return True if conversation (a single list of messages) contains an image or video.
`_get_vision_token_ids`	Collect vision token IDs from a processor / tokenizer / config.
`_scan_attrs`	Yield integer token IDs found via `getattr` on source.
`_warn_no_vision_tokens`	Log a one-time warning when a processor exposes no recognizable vision tokens.
`inject_fake_image_into_conversation`	Inject a fake image into a single conversation’s first user message.
`mask_fake_vision_tokens_batch`	Mask vision tokens in specified batch samples (2D tensors).
`mask_fake_vision_tokens_single`	Mask vision tokens in a single pre-tokenized sample (1D tensors).

Data

_FAKE_IMAGE

_VISION_TOKEN_ID_ATTRS

_VISION_TOKEN_KEYWORDS

_VISION_TOKEN_STRINGS

_warned_unknown_processors

logger

API

nemo_automodel.components.datasets.vlm.fake_image._batch_has_media(
    conversations
)

Return True if any conversation in conversations contains an image or video.

nemo_automodel.components.datasets.vlm.fake_image._conversation_has_media(
    conversation
)

Return True if conversation (a single list of messages) contains an image or video.

nemo_automodel.components.datasets.vlm.fake_image._get_vision_token_ids(
    processor
)

Collect vision token IDs from a processor / tokenizer / config.

Walks three sources to be robust across VLM families:

Known attribute names on the processor and its config (Gemma4, LLaVA put the IDs on the config rather than the processor).
A curated list of vision token strings looked up via tokenizer.convert_tokens_to_ids.
A keyword-based fuzzy scan of tokenizer.added_tokens_decoder so custom or future VLMs are picked up automatically.

nemo_automodel.components.datasets.vlm.fake_image._scan_attrs(
    source,
    attr_names
)

Yield integer token IDs found via getattr on source.

nemo_automodel.components.datasets.vlm.fake_image._warn_no_vision_tokens(
    processor
) -> None

Log a one-time warning when a processor exposes no recognizable vision tokens.

Without this warning a fake-image injection silently leaves the vision tokens visible to attention, which can degrade training quality without any other observable symptom.

nemo_automodel.components.datasets.vlm.fake_image.inject_fake_image_into_conversation(
    conversation
)

Inject a fake image into a single conversation’s first user message.

Returns a deep-copied conversation so the original is never mutated.

nemo_automodel.components.datasets.vlm.fake_image.mask_fake_vision_tokens_batch(
    batch,
    processor,
    sample_indices
)

Mask vision tokens in specified batch samples (2D tensors).

Sets attention_mask = 0 for every vision token in the given sample_indices of the batch.

nemo_automodel.components.datasets.vlm.fake_image.mask_fake_vision_tokens_single(
    sample_dict,
    processor
)

Mask vision tokens in a single pre-tokenized sample (1D tensors).

Sets attention_mask = 0 for every vision token in sample_dict. This is used at __getitem__ time for pre-tokenized datasets.

nemo_automodel.components.datasets.vlm.fake_image._FAKE_IMAGE = PILImage.new('RGB', (56, 56), (255, 255, 255))

nemo_automodel.components.datasets.vlm.fake_image._VISION_TOKEN_ID_ATTRS = ('image_token_id', 'video_token_id', 'image_token_index', 'video_token_index', '...

nemo_automodel.components.datasets.vlm.fake_image._VISION_TOKEN_KEYWORDS = ('image', 'video', 'media', 'vision', 'img_pad', 'vid_pad')

nemo_automodel.components.datasets.vlm.fake_image._VISION_TOKEN_STRINGS = ('<|vision_start|>', '<|vision_end|>', '<|image_pad|>', '<|video_pad|>', '<|medi...

nemo_automodel.components.datasets.vlm.fake_image._warned_unknown_processors: set[str] = set()

nemo_automodel.components.datasets.vlm.fake_image.logger = logging.getLogger(__name__)