`bridge.data.sequence_batching`#

Collate-time sequence batch padding, truncation, and packing helpers.

Module Contents#

Functions#

`_ceil_to_multiple`
`_token_key`
`_set_tokens`
`_pad_or_truncate_2d`
`_pad_or_truncate_position_ids`
`_pad_or_truncate_attention_mask`
`pad_or_pack_sequence`	Pad, truncate, or pack sequence tensors for the training step.

API#

bridge.data.sequence_batching._ceil_to_multiple(value: int, multiple: int) → int#

bridge.data.sequence_batching._token_key( batch: collections.abc.MutableMapping[str, Any], ) → str#

bridge.data.sequence_batching._set_tokens( batch: collections.abc.MutableMapping[str, Any], token_key: str, value: torch.Tensor, ) → None#

bridge.data.sequence_batching._pad_or_truncate_2d( x: torch.Tensor | None, target_len: int, pad_value: int | float, ) → torch.Tensor | None#

bridge.data.sequence_batching._pad_or_truncate_position_ids( position_ids: torch.Tensor | None, target_len: int, ) → torch.Tensor | None#

bridge.data.sequence_batching._pad_or_truncate_attention_mask( attention_mask: torch.Tensor | None, target_len: int, ) → torch.Tensor | None#

bridge.data.sequence_batching.pad_or_pack_sequence( batch: collections.abc.MutableMapping[str, Any], *, sequence_length: int | None, pad_to_max_length: bool = False, pad_to_multiple_of: int = 128, enable_in_batch_packing: bool = False, in_batch_packing_pad_to_multiple_of: int = 1, pad_token_id: int = 0, ignore_index: int = IGNORE_INDEX, ) → None#

Pad, truncate, or pack sequence tensors for the training step.

This is the collate-time policy helper for sequence tensors. When packing is enabled it still uses an internal pad-then-pack helper, because the current model collates first produce padded tensors. Longer term, packing collates should build flattened packed tensors directly.

Parameters:

batch – Mutable collate batch with input_ids or tokens plus labels, loss_mask, position_ids, and optional attention_mask.
sequence_length – Model sequence cap. If unset, non-packed batches are left at the processor’s batch-max length.
pad_to_max_length – If true, pad/truncate non-packed batches directly to sequence_length. This preserves the former PP/EP fixed-shape path.
pad_to_multiple_of – Efficient non-packed length multiple used when pad_to_max_length is false.
enable_in_batch_packing – If true, flatten the microbatch and emit packed-sequence metadata instead of returning a padded attention mask.
in_batch_packing_pad_to_multiple_of – Per-sequence packed length multiple for CP/SP constraints.
pad_token_id – Token value for inserted padding.
ignore_index – Label value for inserted padding.

bridge.data.sequence_batching#

Module Contents#

Functions#

API#

`bridge.data.sequence_batching`#