bridge.data.datasets.packed_sequence#

Module Contents#

Classes#

PackedSequenceSpecs

Configuration class for packed sequence datasets.

Functions#

tokenize_dataset

Tokenizes a dataset from the provided path using the specified tokenizer and prepares it for further processing.

prepare_packed_sequence_data

Prepares a packed sequence dataset from a given input file and saves it to an output file.

Data#

API#

bridge.data.datasets.packed_sequence.logger#

‘getLogger(
)’

bridge.data.datasets.packed_sequence.tokenize_dataset(
path: pathlib.Path,
tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
max_seq_length: int,
seed: int,
)#

Tokenizes a dataset from the provided path using the specified tokenizer and prepares it for further processing.

Parameters:
  • path (Path) – Path to the dataset file.

  • tokenizer (TokenizerSpec) – The tokenizer to use for tokenization.

  • max_seq_length (int) – Maximum sequence length for the tokens.

  • seed (int) – Random seed for shuffling the dataset (optional).

Returns:

A NumPy array containing the tokenized data.

Return type:

np.ndarray

bridge.data.datasets.packed_sequence.prepare_packed_sequence_data(
input_path: pathlib.Path,
output_path: pathlib.Path,
output_metadata_path: pathlib.Path,
packed_sequence_size: int,
tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
max_seq_length: int,
seed: Optional[int] = 0,
packing_algorithm: str = 'first_fit_shuffle',
)#

Prepares a packed sequence dataset from a given input file and saves it to an output file.

Parameters:
  • input_path (Path) – Path to the input dataset file.

  • output_path (Path) – Path to save the packed sequence data.

  • packed_sequence_size (int) – The maximum size for each packed sequence.

  • tokenizer (TokenizerSpec) – The tokenizer to use for tokenization.

  • max_seq_length (int) – Maximum sequence length for the tokens.

  • seed (Optional[int]) – Random seed for shuffling (optional).

  • packing_algorithm (str) – The algorithm used for packing sequences currently supports “first_fit_shuffle” and “first_fit_decreasing”.

Returns:

Saves the packed sequence data to the specified output path.

Return type:

None

class bridge.data.datasets.packed_sequence.PackedSequenceSpecs#

Configuration class for packed sequence datasets.

This class holds parameters related to sequence packing, including the size of the packed sequences, tokenizer information, paths to packed data files, and other related settings.

packed_sequence_size: int#

None

If a positive integer, this arg enables training with sequence packing and specifies the pack size If less than or equal to 0, sequence packing is disabled. Defaults to -1. Note: This arg is distinct from seq_length because seq_length specifies the maximum length of the original sequence (i.e. the length to truncate long sequences in the input data).

tokenizer_model_name: str#

None

Keep track of tokenizer model name, since each tokenizer produces a different packed sequence dataset file. This field is set by llm.finetune api.

packed_train_data_path: str#

None

If specified, use this file for the packed training dataset instead of the default path.

packed_val_data_path: str#

None

If specified, use this file for the packed validation dataset instead of the default path.

packed_metadata_path: str#

None

If specified, use this file for the training and validation packing metadata file instead of the default path.

pad_cu_seqlens: bool#

False

If True, pad cu_seqlens to a constant size, which is required for use with cudagraphs.

__post_init__()#