bridge.data.sequence_packing#

Shared helpers for collate-time in-batch sequence packing.

Module Contents#

Functions#

_sequence_lengths

_ceil_to_multiple

pack_padded_sequences_in_batch

Flatten a padded microbatch and attach packed-sequence metadata.

API#

bridge.data.sequence_packing._sequence_lengths(
tokens: torch.Tensor,
*,
pad_token_id: int,
padding_mask: torch.Tensor | None,
) list[int]#
bridge.data.sequence_packing._ceil_to_multiple(value: int, multiple: int) int#
bridge.data.sequence_packing.pack_padded_sequences_in_batch(
batch: collections.abc.MutableMapping[str, Any],
*,
pad_token_id: int = 0,
ignore_index: int = -100,
pad_to_multiple_of: int = 1,
tokens_key: str = 'tokens',
input_ids_key: str = 'input_ids',
labels_key: str = 'labels',
loss_mask_key: str = 'loss_mask',
position_ids_key: str = 'position_ids',
attention_mask_key: str = 'attention_mask',
) None#

Flatten a padded microbatch and attach packed-sequence metadata.

The helper mutates batch in place. It converts text-like tensors from [B, S] to [1, sum(L_i)] and emits metadata consumed by megatron.bridge.training.gpt_step.get_packed_seq_params.

Parameters:
  • batch – Batch dictionary containing at least tokens/input_ids and position_ids.

  • pad_token_id – Token value to write for padding inserted by pad_to_multiple_of.

  • ignore_index – Label value to write for inserted padding.

  • pad_to_multiple_of – Optional per-sample packed length multiple.

  • tokens_key – Preferred token key.

  • input_ids_key – Optional alias key for tokens.

  • labels_key – Key containing labels to pack when present.

  • loss_mask_key – Key containing loss mask to pack when present.

  • position_ids_key – Key containing position ids to pack.

  • attention_mask_key – Key containing 1/0 padding mask. Set to None after packing.