`bridge.data.datasets.fim_dataset`#

Module Contents#

Classes#

GPTFIMDataset

FIM (Fill In The Middle) GPT Dataset

Data#

logger

API#

bridge.data.datasets.fim_dataset.logger#: ‘getLogger(…)’

class bridge.data.datasets.fim_dataset.GPTFIMDataset( indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset, dataset_path: str, indexed_indices: numpy.ndarray, num_samples: int, index_split: megatron.core.datasets.utils.Split, config: megatron.bridge.training.config.GPTFIMDatasetConfig, )#

Bases: megatron.core.datasets.gpt_dataset.GPTDataset

FIM (Fill In The Middle) GPT Dataset

Parameters:

indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the
MegatronDataset
indexed_indices (np.ndarray) – The set of the documents indices to expose
num_samples (int) – The number of samples to draw from the indexed dataset
index_split (Split) – The indexed_indices Split
config (GPTFIMDatasetConfig) – The GPT-specific container for all config sourced parameters

Initialization

_query_document_sample_shuffle_indices( idx: int, ) → Tuple[numpy.ndarray, numpy.ndarray]#

Get the text (token ids) and document ids for a given index

Parameters:: idx (int) – The index into the dataset
Returns:: The text ids and document ids
Return type:: Tuple[np.ndarray, np.ndarray]

_fim_permute_sequence(sequence, rate)#

_fim_split_and_permute_sequence(sequence)#: If self.fim_split_sample is not None, split the sequence. Then apply FIM on the fragments, or the whole sequence if self.fim_split_sample is None.

_permute( sample, fim_rate, fim_spm_rate, tokenizer, truncate_or_pad=True, suffix_tok_id=None, prefix_tok_id=None, middle_tok_id=None, pad_tok_id=None, no_fim_prefix=None, )#: Take in a sample (np array w/ size (0,chunklength)) and perform a FIM transformation on it. Maintain the same sample length (if transform creates a few extra tokens, drop them).

bridge.data.datasets.fim_dataset#

Module Contents#

Classes#

Data#

API#

`bridge.data.datasets.fim_dataset`#