core.datasets.masked_dataset#
Module Contents#
Classes#
Configuration object for Megatron Core Masked WordPiece datasets |
|
The semi-abstract base class for masked WordPiece datasets |
Data#
API#
- core.datasets.masked_dataset.logger#
‘getLogger(…)’
- class core.datasets.masked_dataset.MaskedWordPieceDatasetConfig#
Bases:
megatron.core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfigConfiguration object for Megatron Core Masked WordPiece datasets
- masking_probability: float#
None
The probability we mask a candidate N-gram
- short_sequence_probability: float#
None
The probability we return a sequence shorter than the target sequence length
- masking_max_ngram: int#
None
The maximum length N-gram to consider masking or permuting
- masking_do_full_word: bool#
None
Whether we mask the whole word or its component parts
- masking_do_permutation: bool#
None
Whether we shuffle a subset of candidate N-grams in addition
- masking_use_longer_ngrams: bool#
None
Whether to favor longer N-grams over shorter N-grams
- masking_use_geometric_distribution: bool#
None
Whether to draw the size of the N-gram from a geometric distribution according to SpanBERT https://arxiv.org/abs/1907.10529 (Section 3.1)
- __post_init__() None#
Do asserts and set fields post init
- class core.datasets.masked_dataset.MaskedWordPieceDataset(
- indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
- dataset_path: str,
- indexed_indices: numpy.ndarray,
- num_samples: Optional[int],
- index_split: megatron.core.datasets.utils.Split,
- config: core.datasets.masked_dataset.MaskedWordPieceDatasetConfig,
Bases:
megatron.core.datasets.megatron_dataset.MegatronDatasetThe semi-abstract base class for masked WordPiece datasets
This implementation makes the rigid assumption that all inheritor datasets are built upon the IndexedDataset class. This assumption may be pushed down to the inheritors in future if necessary.
NB: WordPiece tokenization prepends a double hash “##” to all tokens/pieces in a word, save the first token/piece.
- Parameters:
indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the MegatronDataset
dataset_path (str) – The real path on disk to the dataset, for bookkeeping
indexed_indices (numpy.ndarray) – The set of the documents indices to expose
num_samples (Optional[int]) – The number of samples to draw from the indexed dataset. When None, build as many samples as correspond to one epoch.
index_split (Split) – The indexed_indices Split
config (MaskedWordPieceDatasetConfig) – The config
Initialization
- static numel_low_level_dataset(
- low_level_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
Return the number of documents in the underlying low level dataset.
- Parameters:
low_level_dataset (IndexedDataset) – The underlying IndexedDataset
- Returns:
The number of unique elements in the underlying IndexedDataset
- Return type:
int
- static build_low_level_dataset(
- dataset_path: str,
- config: core.datasets.masked_dataset.MaskedWordPieceDatasetConfig,
Build the low level dataset (IndexedDataset) from the given path.
- Parameters:
dataset_path (str) – The real path prefix to the IndexedDataset .bin and .idx files
config (MaskedWordPieceDatasetConfig) – The config
- Returns:
The underlying IndexedDataset
- Return type:
- static _key_config_attributes() List[str]#
Inherited method implementation
- Returns:
The key config attributes
- Return type:
List[str]
- __len__() int#
- _build_sample_index(
- sequence_length: int,
- min_sentences_per_sample: int,
- _create_masked_lm_predictions(
- token_ids: List[int],
- target_sequence_length: int,
- numpy_random_state: numpy.random.RandomState,
Creates the predictions for the masked LM objective
- Parameters:
token_ids (List[int]) – The token ids
target_sequence_length (int) – The target sequence length
numpy_random_state (numpy.random.RandomState) – The NumPy random state
- Returns:
masked_token_ids -> The masked sequence
masked_positions -> The indices for the masked token ids
masked_labels -> The original token ids for the masked token ids
boundaries -> The sentence and word boundaries for the sequence
masked_spans -> The masked positions and labels with N-gram info intact
- Return type:
Tuple[List[int], List[int], List[int], List[int], List[Tuple[List[int], List[int]]]]
- abstractmethod _get_token_mask(
- numpy_random_state: numpy.random.RandomState,