nemo_automodel.components.datasets.vlm.fake_image#
Fake image injection helpers for FSDP / DeepSpeed Zero3.
When a batch contains no images or videos, the visual encoder is not called during the model forward pass. In FSDP / DeepSpeed Zero3 every parameter must participate in the collective all-gather / reduce-scatter; skipping the visual encoder causes the training to hang.
The fix mirrors LLaMA-Factory’s approach: inject a tiny (56x56) white image
into pure-text samples. The corresponding vision tokens get
attention_mask = 0 so they are invisible to attention and
labels = -100 (automatic, because the fake image lives in a user
message, never an assistant turn). This guarantees model correctness while
keeping the visual encoder active.
Module Contents#
Functions#
Return True if conversation (a single list of messages) contains an image or video. |
|
Return True if any conversation in conversations contains an image or video. |
|
Inject a fake image into a single conversation’s first user message. |
|
Yield integer token IDs found via |
|
Collect vision token IDs from a processor / tokenizer / config. |
|
Log a one-time warning when a processor exposes no recognizable vision tokens. |
|
Mask vision tokens in a single pre-tokenized sample (1D tensors). |
|
Mask vision tokens in specified batch samples (2D tensors). |
Data#
API#
- nemo_automodel.components.datasets.vlm.fake_image.logger#
‘getLogger(…)’
- nemo_automodel.components.datasets.vlm.fake_image._FAKE_IMAGE#
‘new(…)’
- nemo_automodel.components.datasets.vlm.fake_image._conversation_has_media(conversation)#
Return True if conversation (a single list of messages) contains an image or video.
- nemo_automodel.components.datasets.vlm.fake_image._batch_has_media(conversations)#
Return True if any conversation in conversations contains an image or video.
- nemo_automodel.components.datasets.vlm.fake_image.inject_fake_image_into_conversation(conversation)#
Inject a fake image into a single conversation’s first user message.
Returns a deep-copied conversation so the original is never mutated.
- nemo_automodel.components.datasets.vlm.fake_image._VISION_TOKEN_ID_ATTRS#
(‘image_token_id’, ‘video_token_id’, ‘image_token_index’, ‘video_token_index’, ‘media_placeholder_to…
- nemo_automodel.components.datasets.vlm.fake_image._VISION_TOKEN_STRINGS#
(‘<|vision_start|>’, ‘<|vision_end|>’, ‘<|image_pad|>’, ‘<|video_pad|>’, ‘<|media_start|>’, ‘<|media…
- nemo_automodel.components.datasets.vlm.fake_image._VISION_TOKEN_KEYWORDS#
(‘image’, ‘video’, ‘media’, ‘vision’, ‘img_pad’, ‘vid_pad’)
- nemo_automodel.components.datasets.vlm.fake_image._scan_attrs(source, attr_names)#
Yield integer token IDs found via
getattron source.
- nemo_automodel.components.datasets.vlm.fake_image._get_vision_token_ids(processor)#
Collect vision token IDs from a processor / tokenizer / config.
Walks three sources to be robust across VLM families:
Known attribute names on the processor and its
config(Gemma4, LLaVA put the IDs on the config rather than the processor).A curated list of vision token strings looked up via
tokenizer.convert_tokens_to_ids.A keyword-based fuzzy scan of
tokenizer.added_tokens_decoderso custom or future VLMs are picked up automatically.
- nemo_automodel.components.datasets.vlm.fake_image._warned_unknown_processors: set[str]#
‘set(…)’
- nemo_automodel.components.datasets.vlm.fake_image._warn_no_vision_tokens(processor) None#
Log a one-time warning when a processor exposes no recognizable vision tokens.
Without this warning a fake-image injection silently leaves the vision tokens visible to attention, which can degrade training quality without any other observable symptom.
- nemo_automodel.components.datasets.vlm.fake_image.mask_fake_vision_tokens_single(sample_dict, processor)#
Mask vision tokens in a single pre-tokenized sample (1D tensors).
Sets
attention_mask = 0for every vision token in sample_dict. This is used at__getitem__time for pre-tokenized datasets.
- nemo_automodel.components.datasets.vlm.fake_image.mask_fake_vision_tokens_batch(batch, processor, sample_indices)#
Mask vision tokens in specified batch samples (2D tensors).
Sets
attention_mask = 0for every vision token in the given sample_indices of the batch.