bridge.data.datasets.packed_sequence
#
Module Contents#
Classes#
Configuration class for packed sequence datasets. |
Functions#
Tokenizes a dataset from the provided path using the specified tokenizer and prepares it for further processing. |
|
Prepares a packed sequence dataset from a given input file and saves it to an output file. |
Data#
API#
- bridge.data.datasets.packed_sequence.logger#
âgetLogger(âŠ)â
- bridge.data.datasets.packed_sequence.tokenize_dataset(
- path: pathlib.Path,
- tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
- max_seq_length: int,
- seed: int,
Tokenizes a dataset from the provided path using the specified tokenizer and prepares it for further processing.
- Parameters:
path (Path) â Path to the dataset file.
tokenizer (TokenizerSpec) â The tokenizer to use for tokenization.
max_seq_length (int) â Maximum sequence length for the tokens.
seed (int) â Random seed for shuffling the dataset (optional).
- Returns:
A NumPy array containing the tokenized data.
- Return type:
np.ndarray
- bridge.data.datasets.packed_sequence.prepare_packed_sequence_data(
- input_path: pathlib.Path,
- output_path: pathlib.Path,
- output_metadata_path: pathlib.Path,
- packed_sequence_size: int,
- tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
- max_seq_length: int,
- seed: Optional[int] = 0,
- packing_algorithm: str = 'first_fit_shuffle',
Prepares a packed sequence dataset from a given input file and saves it to an output file.
- Parameters:
input_path (Path) â Path to the input dataset file.
output_path (Path) â Path to save the packed sequence data.
packed_sequence_size (int) â The maximum size for each packed sequence.
tokenizer (TokenizerSpec) â The tokenizer to use for tokenization.
max_seq_length (int) â Maximum sequence length for the tokens.
seed (Optional[int]) â Random seed for shuffling (optional).
packing_algorithm (str) â The algorithm used for packing sequences currently supports âfirst_fit_shuffleâ and âfirst_fit_decreasingâ.
- Returns:
Saves the packed sequence data to the specified output path.
- Return type:
None
- class bridge.data.datasets.packed_sequence.PackedSequenceSpecs#
Configuration class for packed sequence datasets.
This class holds parameters related to sequence packing, including the size of the packed sequences, tokenizer information, paths to packed data files, and other related settings.
- packed_sequence_size: int#
None
If a positive integer, this arg enables training with sequence packing and specifies the pack size If less than or equal to 0, sequence packing is disabled. Defaults to -1. Note: This arg is distinct from
seq_length
becauseseq_length
specifies the maximum length of the original sequence (i.e. the length to truncate long sequences in the input data).
- tokenizer_model_name: str#
None
Keep track of tokenizer model name, since each tokenizer produces a different packed sequence dataset file. This field is set by llm.finetune api.
- packed_train_data_path: str#
None
If specified, use this file for the packed training dataset instead of the default path.
- packed_val_data_path: str#
None
If specified, use this file for the packed validation dataset instead of the default path.
- packed_metadata_path: str#
None
If specified, use this file for the training and validation packing metadata file instead of the default path.
- pad_cu_seqlens: bool#
False
If True, pad cu_seqlens to a constant size, which is required for use with cudagraphs.
- __post_init__()#