nemo_automodel.components.datasets.vlm.fake_image
nemo_automodel.components.datasets.vlm.fake_image
Fake image injection helpers for FSDP / DeepSpeed Zero3.
When a batch contains no images or videos, the visual encoder is not called during the model forward pass. In FSDP / DeepSpeed Zero3 every parameter must participate in the collective all-gather / reduce-scatter; skipping the visual encoder causes the training to hang.
The fix mirrors LLaMA-Factory’s approach: inject a tiny (56x56) white image
into pure-text samples. The corresponding vision tokens get
attention_mask = 0 so they are invisible to attention and
labels = -100 (automatic, because the fake image lives in a user
message, never an assistant turn). This guarantees model correctness while
keeping the visual encoder active.
Module Contents
Functions
Data
API
Return True if any conversation in conversations contains an image or video.
Return True if conversation (a single list of messages) contains an image or video.
Collect vision token IDs from a processor / tokenizer / config.
Walks three sources to be robust across VLM families:
- Known attribute names on the processor and its
config(Gemma4, LLaVA put the IDs on the config rather than the processor). - A curated list of vision token strings looked up via
tokenizer.convert_tokens_to_ids. - A keyword-based fuzzy scan of
tokenizer.added_tokens_decoderso custom or future VLMs are picked up automatically.
Yield integer token IDs found via getattr on source.
Log a one-time warning when a processor exposes no recognizable vision tokens.
Without this warning a fake-image injection silently leaves the vision tokens visible to attention, which can degrade training quality without any other observable symptom.
Inject a fake image into a single conversation’s first user message.
Returns a deep-copied conversation so the original is never mutated.
Mask vision tokens in specified batch samples (2D tensors).
Sets attention_mask = 0 for every vision token in the given
sample_indices of the batch.
Mask vision tokens in a single pre-tokenized sample (1D tensors).
Sets attention_mask = 0 for every vision token in sample_dict.
This is used at __getitem__ time for pre-tokenized datasets.