bridge.data.energon.hf_encoder_task_encoder#

Generic HF-encoder VLM task encoder for Energon dataloading.

Works with any HF processor that handles tokenization + vision preprocessing in a single processor() call (e.g. Gemma3-VL, Ministral3, GLM-4.5V).

Module Contents#

Classes#

HFEncoderTaskSample

Encoded sample for a generic HF-encoder VLM.

HFEncoderTaskBatch

Batched format for a generic HF-encoder VLM.

HFEncoderVLMTaskEncoder

Task encoder for HF-encoder VLMs that rely on processor() for tokenization + vision.

API#

class bridge.data.energon.hf_encoder_task_encoder.HFEncoderTaskSample#

Encoded sample for a generic HF-encoder VLM.

__key__: str#

None

__subflavors__: Dict#

None

input_ids: torch.Tensor#

None

labels: torch.Tensor#

None

loss_mask: torch.Tensor#

None

visual_tensors: Dict[str, torch.Tensor]#

‘field(…)’

class bridge.data.energon.hf_encoder_task_encoder.HFEncoderTaskBatch#

Bases: megatron.energon.Batch

Batched format for a generic HF-encoder VLM.

__keys__: List[str]#

‘field(…)’

__subflavors__: List[Dict]#

‘field(…)’

input_ids: torch.Tensor#

‘field(…)’

labels: torch.Tensor#

‘field(…)’

loss_mask: torch.Tensor#

‘field(…)’

attention_mask: torch.Tensor#

‘field(…)’

position_ids: torch.Tensor#

‘field(…)’

visual_tensors: Dict[str, Optional[torch.Tensor]]#

‘field(…)’

class bridge.data.energon.hf_encoder_task_encoder.HFEncoderVLMTaskEncoder(
processor,
seq_length: int = 4096,
visual_keys: Sequence[str] = ('pixel_values',),
min_pixels: Optional[int] = None,
max_pixels: Optional[int] = None,
)#

Bases: megatron.energon.DefaultTaskEncoder[megatron.bridge.data.energon.task_encoder_utils.ChatMLSample, bridge.data.energon.hf_encoder_task_encoder.HFEncoderTaskSample, bridge.data.energon.hf_encoder_task_encoder.HFEncoderTaskBatch, dict]

Task encoder for HF-encoder VLMs that rely on processor() for tokenization + vision.

Parameters:
  • processor – HF AutoProcessor instance. Must support apply_chat_template and __call__(text=..., images=..., ...) returning input_ids and visual tensor keys.

  • seq_length – Maximum sequence length (tokens are truncated to this).

  • visual_keys – Which keys from the processor output to capture as visual tensors (e.g. ("pixel_values",) for Gemma3-VL / Ministral3, ("pixel_values", "pixel_values_videos", "image_grid_thw", "video_grid_thw") for GLM-4.5V).

  • min_pixels – Optional min pixel constraint forwarded to the processor.

  • max_pixels – Optional max pixel constraint forwarded to the processor.

Initialization

property _tokenizer#

Return the underlying tokenizer from the processor.

property _pad_token_id: int#
property _eos_token_id: int#
property _image_token_id: Optional[int]#

Resolve image token ID from the processor, or None if unavailable.

static _find_contiguous_blocks(
arr: numpy.ndarray,
value: int,
) List[Tuple[int, int]]#

Return [(start, end), ...] for each contiguous run of value in arr.

encode_sample(
sample: megatron.bridge.data.energon.task_encoder_utils.ChatMLSample,
) bridge.data.energon.hf_encoder_task_encoder.HFEncoderTaskSample#

Encode a single ChatML sample into model-ready tensors.

  1. Convert WDS tensor images/videos to PIL.

  2. Normalize conversation via cook_chatml_sample.

  3. Use the HF processor’s apply_chat_template to get the prompt text, then call processor(text=..., images=...) for joint tokenization + vision preprocessing.

  4. Build a loss mask that only supervises assistant turns.

  5. Truncate to seq_length.

batch(
samples: List[bridge.data.energon.hf_encoder_task_encoder.HFEncoderTaskSample],
) bridge.data.energon.hf_encoder_task_encoder.HFEncoderTaskBatch#

Pad and collate a list of encoded samples into a batch.

encode_batch(
batch: bridge.data.energon.hf_encoder_task_encoder.HFEncoderTaskBatch,
) dict#

Convert batch dataclass to dict, wrapping visual tensors in GenericVisualInputs.