bridge.data.vlm_processing#
Shared VLM processing helpers for Energon and HF dataset paths.
Module Contents#
Classes#
Source-normalized VLM sample consumed by shared processing. |
Functions#
Normalize an Energon |
|
Normalize a HF-style VLM dataset example into the shared sample contract. |
|
Convert a normalized VLM sample into the HF-style collate example schema. |
|
Return the tokenizer attached to a processor, or the object itself. |
|
Find the first |
|
Extract assistant text segments from a structured VLM conversation. |
|
Tokenize text using a HF-like tokenizer without adding special tokens. |
|
Build an unshifted assistant-only loss mask using the current token-search behavior. |
|
Build next-token labels and shifted loss mask for Megatron training. |
|
Attach |
|
Ensure a collated batch has 2D position IDs. |
|
Move processor visual tensor keys into |
API#
- class bridge.data.vlm_processing.NormalizedVLMSample#
Source-normalized VLM sample consumed by shared processing.
Expected input format: Instances are produced by source adapters such as :func:
normalize_energon_vlm_sampleand :func:normalize_hf_vlm_example.conversationmust be a structured list of chat turns in HF processor format, for example::[ {"role": "user", "content": "describe <image>"}, {"role": "assistant", "content": "a red car"}, ] ``content`` may also be a multimodal content list accepted by ``processor.apply_chat_template``, for example:: [{"type": "image", "image": image_obj}, {"type": "text", "text": "describe"}] ``images`` and ``videos`` are optional processor-ready modality payloads. Energon adapters convert WDS tensors to PIL objects before populating these fields; HF adapters may leave them ``None`` when media already lives inline in ``conversation`` content.Output format: Shared processing treats this as the single boundary contract before model-specific tokenization and vision preprocessing. It does not contain batched tensors; model-specific collators convert it into model input tensors.
- conversation: list[dict[str, Any]]#
None
- images: list[Any] | None#
None
- videos: list[Any] | None#
None
- audio: Any | None#
None
- bridge.data.vlm_processing.normalize_energon_vlm_sample(
- sample: Any,
Normalize an Energon
ChatMLSampleinto the shared VLM sample contract.Expected input format:
sampleis expected to expose the EnergonChatMLSamplefields:- ``conversation``: JSON string accepted by ``cook_chatml_sample``. The JSON may use either ``{"role": ..., "content": ...}`` turns or ``{"from": ..., "value": ...}`` turns. - ``imgs``: optional WDS decoded image tensor/list payload. - ``videos``: optional WDS decoded video tensor/list payload. - ``audio``: optional audio payload, passed through unchanged.Output format: Returns
NormalizedVLMSamplewhereconversationis a list of{"role": str, "content": str | list[dict]}turns,imagesare PIL/list processor inputs orNone,videosare nested PIL/list processor inputs orNone, andaudiois copied from the source sample when present.
- bridge.data.vlm_processing.normalize_hf_vlm_example(
- example: collections.abc.Mapping[str, Any],
Normalize a HF-style VLM dataset example into the shared sample contract.
Expected input format:
examplemust contain"conversation"as a structured list of chat turns already produced by an HF dataset maker, for example::{ "conversation": [ {"role": "user", "content": [{"type": "image", "image": img}, {"type": "text", "text": "Q"}]}, {"role": "assistant", "content": [{"type": "text", "text": "A"}]}, ], "audio": optional_audio, } Optional top-level ``images``/``image`` and ``videos``/``video`` fields are accepted for maker variants that do not embed media inline in the conversation.Output format: Returns
NormalizedVLMSamplewith a deep-copied structuredconversationlist, optional list-valuedimagesandvideospayloads, and optionalaudio. The adapter does not callcook_chatml_samplebecause HF makers have already normalized the chat schema.- Raises:
ValueError – If
example["conversation"]is missing or is not a list.
- bridge.data.vlm_processing.normalized_vlm_sample_to_hf_example(
- sample: bridge.data.vlm_processing.NormalizedVLMSample,
- *,
- media_first: bool = False,
Convert a normalized VLM sample into the HF-style collate example schema.
Expected input format:
samplefollowsNormalizedVLMSample:conversationis a list of chat turns, and optionalimages/videoscontain processor-ready media payloads such as PIL images or decoded video frame lists. Text turns may contain literal<image>/<video>placeholders.Output format: Returns a dictionary suitable for VLM HF collate functions::
{ "conversation": [ { "role": "user", "content": [ {"type": "image", "image": image_obj}, {"type": "text", "text": "describe"}, ], }, {"role": "assistant", "content": [{"type": "text", "text": "answer"}]}, ], "images": [image_obj], # present when sample.images is not None "videos": [video_obj], # present when sample.videos is not None "audio": audio_obj, # present when sample.audio is not None } Inline media parts are populated from ``sample.images`` and ``sample.videos`` in placeholder order. When ``media_first=True``, media parts are moved before text parts within each turn to preserve Qwen Energon's legacy media-before-text ordering while still using the shared HF collate function.
- bridge.data.vlm_processing.get_processor_tokenizer(processor: Any) Any#
Return the tokenizer attached to a processor, or the object itself.
- bridge.data.vlm_processing.find_token_span(
- sequence: collections.abc.Sequence[int] | torch.Tensor,
- pattern: collections.abc.Sequence[int],
- start: int = 0,
Find the first
[start, end)token span matchingpattern.- Parameters:
sequence – Token id sequence to search.
pattern – Token id pattern to locate.
start – Index to begin searching from.
- Returns:
(start, end)for the first match, or(-1, -1)when no match exists.
- bridge.data.vlm_processing.gather_assistant_text_segments(
- example_or_conversation: collections.abc.Mapping[str, Any] | collections.abc.Sequence[collections.abc.Mapping[str, Any]],
Extract assistant text segments from a structured VLM conversation.
- bridge.data.vlm_processing.tokenize_text_without_special_tokens(
- tokenizer: Any,
- text: str,
Tokenize text using a HF-like tokenizer without adding special tokens.
- bridge.data.vlm_processing._assistant_text_variants(
- text: str,
- *,
- include_search_variants: bool,
- bridge.data.vlm_processing.build_assistant_loss_mask(
- example_or_conversation: collections.abc.Mapping[str, Any] | collections.abc.Sequence[collections.abc.Mapping[str, Any]],
- input_ids: collections.abc.Sequence[int] | torch.Tensor,
- processor: Any,
- skipped_tokens: torch.Tensor | None = None,
- *,
- include_search_variants: bool = True,
- require_matches: bool = False,
- warn_on_all_masked: bool = True,
Build an unshifted assistant-only loss mask using the current token-search behavior.
This intentionally preserves the existing text-search masking strategy. Step 3 of issue #4041 can replace this helper with a generation-tag or boundary-token implementation without touching every VLM collate/task encoder again.
- bridge.data.vlm_processing.build_shifted_labels_and_loss_mask(
- input_ids: torch.Tensor,
- assistant_loss_mask: torch.Tensor,
- skipped_tokens: torch.Tensor | None = None,
- *,
- ignore_index: int = IGNORE_INDEX,
Build next-token labels and shifted loss mask for Megatron training.
- bridge.data.vlm_processing.apply_assistant_labels_to_batch(
- batch: collections.abc.MutableMapping[str, Any],
- examples: collections.abc.Sequence[collections.abc.Mapping[str, Any]],
- processor: Any,
- skipped_tokens: torch.Tensor,
- *,
- unmask_last_token: bool = False,
Attach
labelsandloss_maskto a collated HF VLM batch.
- bridge.data.vlm_processing.ensure_position_ids(
- batch: collections.abc.MutableMapping[str, Any],
Ensure a collated batch has 2D position IDs.
- bridge.data.vlm_processing.pop_generic_visual_inputs(
- batch: collections.abc.MutableMapping[str, Any],
- visual_keys: collections.abc.Sequence[str],
Move processor visual tensor keys into
GenericVisualInputs.