bridge.data.datasets.fim_dataset#

Module Contents#

Classes#

GPTFIMDataset

FIM (Fill In The Middle) GPT Dataset

Data#

API#

bridge.data.datasets.fim_dataset.logger#

‘getLogger(…)’

class bridge.data.datasets.fim_dataset.GPTFIMDataset(
indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
dataset_path: str,
indexed_indices: numpy.ndarray,
num_samples: int,
index_split: megatron.core.datasets.utils.Split,
config: megatron.bridge.training.config.GPTFIMDatasetConfig,
)#

Bases: megatron.core.datasets.gpt_dataset.GPTDataset

FIM (Fill In The Middle) GPT Dataset

Parameters:
  • indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the

  • MegatronDataset

  • indexed_indices (np.ndarray) – The set of the documents indices to expose

  • num_samples (int) – The number of samples to draw from the indexed dataset

  • index_split (Split) – The indexed_indices Split

  • config (GPTFIMDatasetConfig) – The GPT-specific container for all config sourced parameters

Initialization

_query_document_sample_shuffle_indices(
idx: int,
) Tuple[numpy.ndarray, numpy.ndarray]#

Get the text (token ids) and document ids for a given index

Parameters:

idx (int) – The index into the dataset

Returns:

The text ids and document ids

Return type:

Tuple[np.ndarray, np.ndarray]

_fim_permute_sequence(sequence, rate)#
_fim_split_and_permute_sequence(sequence)#

If self.fim_split_sample is not None, split the sequence. Then apply FIM on the fragments, or the whole sequence if self.fim_split_sample is None.

_permute(
sample,
fim_rate,
fim_spm_rate,
tokenizer,
truncate_or_pad=True,
suffix_tok_id=None,
prefix_tok_id=None,
middle_tok_id=None,
pad_tok_id=None,
no_fim_prefix=None,
)#

Take in a sample (np array w/ size (0,chunklength)) and perform a FIM transformation on it. Maintain the same sample length (if transform creates a few extra tokens, drop them).