bridge.data.datasets.packed_sequence#

Module Contents#

Classes#

PackedSequenceSpecs

Configuration class for packed sequence datasets.

Functions#

_tokenize_get_item

_tokenize_init_worker

_retrieve_tokenized

tokenize_dataset

Tokenizes a dataset from the provided path using the specified tokenizer and prepares it for further processing.

prepare_packed_sequence_data

Prepares a packed sequence dataset from a given input file and saves it to an output file.

Data#

API#

bridge.data.datasets.packed_sequence.logger#

‘getLogger(…)’

bridge.data.datasets.packed_sequence._shared_dataset#

None

bridge.data.datasets.packed_sequence._tokenize_get_item(i)#
bridge.data.datasets.packed_sequence._tokenize_init_worker(dataset)#
bridge.data.datasets.packed_sequence._retrieve_tokenized(dataset, num_workers)#
bridge.data.datasets.packed_sequence.tokenize_dataset(
path: pathlib.Path,
tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
max_seq_length: int,
seed: int,
dataset_kwargs: dict | None = None,
pad_seq_to_mult: int | None = 1,
num_tokenizer_workers: int = -1,
)#

Tokenizes a dataset from the provided path using the specified tokenizer and prepares it for further processing.

Parameters:
  • path (Path) – Path to the dataset file.

  • tokenizer (MegatronTokenizer) – The tokenizer to use for tokenization.

  • max_seq_length (int) – Maximum sequence length for the tokens.

  • seed (int) – Random seed for shuffling the dataset.

  • dataset_kwargs (dict | None) – Additional keyword arguments to pass to create_sft_dataset. Can include ‘chat’, ‘use_hf_tokenizer_chat_template’, ‘tool_schemas’, etc.

  • pad_seq_to_mult (int | None) – Optional multiple to pad each sequence to during packing preparation (e.g., set to 2 * context_parallel_size for THD CP).

Returns:

A NumPy array containing the tokenized data.

Return type:

np.ndarray

bridge.data.datasets.packed_sequence.prepare_packed_sequence_data(
input_path: pathlib.Path,
output_path: pathlib.Path,
output_metadata_path: pathlib.Path,
packed_sequence_size: int,
tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
max_seq_length: int,
seed: int | None = 0,
packing_algorithm: str = 'first_fit_shuffle',
dataset_kwargs: dict | None = None,
pad_seq_to_mult: int | None = 1,
num_tokenizer_workers: int = -1,
)#

Prepares a packed sequence dataset from a given input file and saves it to an output file.

Parameters:
  • input_path (Path) – Path to the input dataset file.

  • output_path (Path) – Path to save the packed sequence data.

  • output_metadata_path (Path) – Path to save packing metadata.

  • packed_sequence_size (int) – The maximum size for each packed sequence.

  • tokenizer (MegatronTokenizer) – The tokenizer to use for tokenization.

  • max_seq_length (int) – Maximum sequence length for the tokens.

  • seed (int | None) – Random seed for shuffling (optional).

  • packing_algorithm (str) – The algorithm used for packing sequences currently supports “first_fit_shuffle” and “first_fit_decreasing”.

  • dataset_kwargs (dict | None) – Additional keyword arguments to pass to create_sft_dataset. Enables packing with chat templates, tool schemas, etc.

  • pad_seq_to_mult (int | None) – Optional multiple to pad each sequence to during packing preparation (e.g., set to 2 * context_parallel_size for THD CP).

Returns:

Saves the packed sequence data to the specified output path.

Return type:

None

class bridge.data.datasets.packed_sequence.PackedSequenceSpecs#

Configuration class for packed sequence datasets.

This class holds parameters related to sequence packing, including the size of the packed sequences, tokenizer information, paths to packed data files, and other related settings.

packed_sequence_size: int#

None

If a positive integer, this arg enables training with sequence packing and specifies the pack size If less than or equal to 0, sequence packing is disabled. Defaults to -1. Note: This arg is distinct from seq_length because seq_length specifies the maximum length of the original sequence (i.e. the length to truncate long sequences in the input data).

tokenizer_model_name: str#

None

Keep track of tokenizer model name, since each tokenizer produces a different packed sequence dataset file. This field is set by llm.finetune api.

num_tokenizer_workers: int#

None

The number of worker processes to use for tokenization when preparing the packed sequence dataset. If -1, the number of workers will be set to the number of CPU cores available

packed_train_data_path: str#

None

If specified, use this file for the packed training dataset instead of the default path.

packed_val_data_path: str#

None

If specified, use this file for the packed validation dataset instead of the default path.

packed_metadata_path: str#

None

If specified, use this file for the training and validation packing metadata file instead of the default path.

pad_cu_seqlens: bool#

False

If True, pad cu_seqlens to a constant size, which is required for use with cudagraphs.

pad_seq_to_mult: int | None#

1

Optional multiple to pad each sample to when generating packed datasets. For THD/context parallel, set to (context_parallel_size * 2) to keep samples divisible.

__post_init__()#
_validate_packed_path(attr_name: str, path_value: str) None#

Validate a packed data path and store it appropriately.

For .npy files: strict validation with Path.exists() For packed parquet specs: validate via resolution (supports dirs/globs)

Parameters:
  • attr_name – The attribute name being validated (for error messages)

  • path_value – The path value to validate

Raises:
  • FileNotFoundError – If the path does not exist or resolves to no files

  • ValueError – If the path format is invalid