bridge.data.sequence_packing#
Shared helpers for collate-time in-batch sequence packing.
Module Contents#
Functions#
Flatten a padded microbatch and attach packed-sequence metadata. |
API#
- bridge.data.sequence_packing._sequence_lengths(
- tokens: torch.Tensor,
- *,
- pad_token_id: int,
- padding_mask: torch.Tensor | None,
- bridge.data.sequence_packing._ceil_to_multiple(value: int, multiple: int) int#
- bridge.data.sequence_packing.pack_padded_sequences_in_batch(
- batch: collections.abc.MutableMapping[str, Any],
- *,
- pad_token_id: int = 0,
- ignore_index: int = -100,
- pad_to_multiple_of: int = 1,
- tokens_key: str = 'tokens',
- input_ids_key: str = 'input_ids',
- labels_key: str = 'labels',
- loss_mask_key: str = 'loss_mask',
- position_ids_key: str = 'position_ids',
- attention_mask_key: str = 'attention_mask',
Flatten a padded microbatch and attach packed-sequence metadata.
The helper mutates
batchin place. It converts text-like tensors from[B, S]to[1, sum(L_i)]and emits metadata consumed bymegatron.bridge.training.gpt_step.get_packed_seq_params.- Parameters:
batch – Batch dictionary containing at least tokens/input_ids and position_ids.
pad_token_id – Token value to write for padding inserted by
pad_to_multiple_of.ignore_index – Label value to write for inserted padding.
pad_to_multiple_of – Optional per-sample packed length multiple.
tokens_key – Preferred token key.
input_ids_key – Optional alias key for tokens.
labels_key – Key containing labels to pack when present.
loss_mask_key – Key containing loss mask to pack when present.
position_ids_key – Key containing position ids to pack.
attention_mask_key – Key containing 1/0 padding mask. Set to
Noneafter packing.