bridge.data.datasets.fim_dataset#
Module Contents#
Classes#
FIM (Fill In The Middle) GPT Dataset |
Data#
API#
- bridge.data.datasets.fim_dataset.logger#
‘getLogger(…)’
- class bridge.data.datasets.fim_dataset.GPTFIMDataset(
- indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
- dataset_path: str,
- indexed_indices: numpy.ndarray,
- num_samples: int,
- index_split: megatron.core.datasets.utils.Split,
- config: megatron.bridge.training.config.GPTFIMDatasetConfig,
Bases:
megatron.core.datasets.gpt_dataset.GPTDatasetFIM (Fill In The Middle) GPT Dataset
- Parameters:
indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the
MegatronDataset
indexed_indices (np.ndarray) – The set of the documents indices to expose
num_samples (int) – The number of samples to draw from the indexed dataset
index_split (Split) – The indexed_indices Split
config (GPTFIMDatasetConfig) – The GPT-specific container for all config sourced parameters
Initialization
- _query_document_sample_shuffle_indices(
- idx: int,
Get the text (token ids) and document ids for a given index
- Parameters:
idx (int) – The index into the dataset
- Returns:
The text ids and document ids
- Return type:
Tuple[np.ndarray, np.ndarray]
- _fim_permute_sequence(sequence, rate)#
- _fim_split_and_permute_sequence(sequence)#
If self.fim_split_sample is not None, split the sequence. Then apply FIM on the fragments, or the whole sequence if self.fim_split_sample is None.
- _permute(
- sample,
- fim_rate,
- fim_spm_rate,
- tokenizer,
- truncate_or_pad=True,
- suffix_tok_id=None,
- prefix_tok_id=None,
- middle_tok_id=None,
- pad_tok_id=None,
- no_fim_prefix=None,
Take in a sample (np array w/ size (0,chunklength)) and perform a FIM transformation on it. Maintain the same sample length (if transform creates a few extra tokens, drop them).