bridge.data.energon.hf_encoder_task_encoder#
Generic HF-encoder VLM task encoder for Energon dataloading.
Works with any HF processor that handles tokenization + vision preprocessing
in a single processor() call (e.g. Gemma3-VL, Ministral3, GLM-4.5V).
Module Contents#
Classes#
Encoded sample for a generic HF-encoder VLM. |
|
Batched format for a generic HF-encoder VLM. |
|
Task encoder for HF-encoder VLMs that rely on |
API#
- class bridge.data.energon.hf_encoder_task_encoder.HFEncoderTaskSample#
Encoded sample for a generic HF-encoder VLM.
- __key__: str#
None
- __subflavors__: Dict#
None
- input_ids: torch.Tensor#
None
- labels: torch.Tensor#
None
- loss_mask: torch.Tensor#
None
- visual_tensors: Dict[str, torch.Tensor]#
‘field(…)’
- class bridge.data.energon.hf_encoder_task_encoder.HFEncoderTaskBatch#
Bases:
megatron.energon.BatchBatched format for a generic HF-encoder VLM.
- __keys__: List[str]#
‘field(…)’
- __subflavors__: List[Dict]#
‘field(…)’
- input_ids: torch.Tensor#
‘field(…)’
- labels: torch.Tensor#
‘field(…)’
- loss_mask: torch.Tensor#
‘field(…)’
- attention_mask: torch.Tensor#
‘field(…)’
- position_ids: torch.Tensor#
‘field(…)’
- visual_tensors: Dict[str, Optional[torch.Tensor]]#
‘field(…)’
- class bridge.data.energon.hf_encoder_task_encoder.HFEncoderVLMTaskEncoder(
- processor,
- seq_length: int = 4096,
- visual_keys: Sequence[str] = ('pixel_values',),
- min_pixels: Optional[int] = None,
- max_pixels: Optional[int] = None,
Bases:
megatron.energon.DefaultTaskEncoder[megatron.bridge.data.energon.task_encoder_utils.ChatMLSample,bridge.data.energon.hf_encoder_task_encoder.HFEncoderTaskSample,bridge.data.energon.hf_encoder_task_encoder.HFEncoderTaskBatch,dict]Task encoder for HF-encoder VLMs that rely on
processor()for tokenization + vision.- Parameters:
processor – HF
AutoProcessorinstance. Must supportapply_chat_templateand__call__(text=..., images=..., ...)returninginput_idsand visual tensor keys.seq_length – Maximum sequence length (tokens are truncated to this).
visual_keys – Which keys from the processor output to capture as visual tensors (e.g.
("pixel_values",)for Gemma3-VL / Ministral3,("pixel_values", "pixel_values_videos", "image_grid_thw", "video_grid_thw")for GLM-4.5V).min_pixels – Optional min pixel constraint forwarded to the processor.
max_pixels – Optional max pixel constraint forwarded to the processor.
Initialization
- property _tokenizer#
Return the underlying tokenizer from the processor.
- property _pad_token_id: int#
- property _eos_token_id: int#
- property _image_token_id: Optional[int]#
Resolve image token ID from the processor, or
Noneif unavailable.
- static _find_contiguous_blocks(
- arr: numpy.ndarray,
- value: int,
Return
[(start, end), ...]for each contiguous run of value in arr.
- encode_sample(
- sample: megatron.bridge.data.energon.task_encoder_utils.ChatMLSample,
Encode a single ChatML sample into model-ready tensors.
Convert WDS tensor images/videos to PIL.
Normalize conversation via
cook_chatml_sample.Use the HF processor’s
apply_chat_templateto get the prompt text, then callprocessor(text=..., images=...)for joint tokenization + vision preprocessing.Build a loss mask that only supervises assistant turns.
Truncate to
seq_length.
- batch(
- samples: List[bridge.data.energon.hf_encoder_task_encoder.HFEncoderTaskSample],
Pad and collate a list of encoded samples into a batch.
- encode_batch( ) dict#
Convert batch dataclass to dict, wrapping visual tensors in
GenericVisualInputs.