bridge.data.sequence_batching#

Collate-time sequence batch padding, truncation, and packing helpers.

Module Contents#

Functions#

API#

bridge.data.sequence_batching._ceil_to_multiple(value: int, multiple: int) int#
bridge.data.sequence_batching._token_key(
batch: collections.abc.MutableMapping[str, Any],
) str#
bridge.data.sequence_batching._set_tokens(
batch: collections.abc.MutableMapping[str, Any],
token_key: str,
value: torch.Tensor,
) None#
bridge.data.sequence_batching._pad_or_truncate_2d(
x: torch.Tensor | None,
target_len: int,
pad_value: int | float,
) torch.Tensor | None#
bridge.data.sequence_batching._pad_or_truncate_position_ids(
position_ids: torch.Tensor | None,
target_len: int,
) torch.Tensor | None#
bridge.data.sequence_batching._pad_or_truncate_attention_mask(
attention_mask: torch.Tensor | None,
target_len: int,
) torch.Tensor | None#
bridge.data.sequence_batching.pad_or_pack_sequence(
batch: collections.abc.MutableMapping[str, Any],
*,
sequence_length: int | None,
pad_to_max_length: bool = False,
pad_to_multiple_of: int = 128,
enable_in_batch_packing: bool = False,
in_batch_packing_pad_to_multiple_of: int = 1,
pad_token_id: int = 0,
ignore_index: int = IGNORE_INDEX,
) None#

Pad, truncate, or pack sequence tensors for the training step.

This is the collate-time policy helper for sequence tensors. When packing is enabled it still uses an internal pad-then-pack helper, because the current model collates first produce padded tensors. Longer term, packing collates should build flattened packed tensors directly.

Parameters:
  • batch – Mutable collate batch with input_ids or tokens plus labels, loss_mask, position_ids, and optional attention_mask.

  • sequence_length – Model sequence cap. If unset, non-packed batches are left at the processor’s batch-max length.

  • pad_to_max_length – If true, pad/truncate non-packed batches directly to sequence_length. This preserves the former PP/EP fixed-shape path.

  • pad_to_multiple_of – Efficient non-packed length multiple used when pad_to_max_length is false.

  • enable_in_batch_packing – If true, flatten the microbatch and emit packed-sequence metadata instead of returning a padded attention mask.

  • in_batch_packing_pad_to_multiple_of – Per-sequence packed length multiple for CP/SP constraints.

  • pad_token_id – Token value for inserted padding.

  • ignore_index – Label value for inserted padding.