`core.datasets.masked_dataset`#

Module Contents#

Classes#

`MaskedWordPieceDatasetConfig`	Configuration object for Megatron Core Masked WordPiece datasets
`MaskedWordPieceDataset`	The semi-abstract base class for masked WordPiece datasets

Data#

logger

API#

core.datasets.masked_dataset.logger#: ‘getLogger(…)’

class core.datasets.masked_dataset.MaskedWordPieceDatasetConfig#

Bases: megatron.core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfig

Configuration object for Megatron Core Masked WordPiece datasets

masking_probability: float#

None

The probability we mask a candidate N-gram

short_sequence_probability: float#

None

The probability we return a sequence shorter than the target sequence length

masking_max_ngram: int#

None

The maximum length N-gram to consider masking or permuting

masking_do_full_word: bool#

None

Whether we mask the whole word or its component parts

masking_do_permutation: bool#

None

Whether we shuffle a subset of candidate N-grams in addition

masking_use_longer_ngrams: bool#

None

Whether to favor longer N-grams over shorter N-grams

masking_use_geometric_distribution: bool#

None

Whether to draw the size of the N-gram from a geometric distribution according to SpanBERT https://arxiv.org/abs/1907.10529 (Section 3.1)

__post_init__() → None#: Do asserts and set fields post init

class core.datasets.masked_dataset.MaskedWordPieceDataset( indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset, dataset_path: str, indexed_indices: numpy.ndarray, num_samples: Optional[int], index_split: megatron.core.datasets.utils.Split, config: core.datasets.masked_dataset.MaskedWordPieceDatasetConfig, )#

Bases: megatron.core.datasets.megatron_dataset.MegatronDataset

The semi-abstract base class for masked WordPiece datasets

This implementation makes the rigid assumption that all inheritor datasets are built upon the IndexedDataset class. This assumption may be pushed down to the inheritors in future if necessary.

NB: WordPiece tokenization prepends a double hash “##” to all tokens/pieces in a word, save the first token/piece.

Parameters:

indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the MegatronDataset
dataset_path (str) – The real path on disk to the dataset, for bookkeeping
indexed_indices (numpy.ndarray) – The set of the documents indices to expose
num_samples (Optional[int]) – The number of samples to draw from the indexed dataset. When None, build as many samples as correspond to one epoch.
index_split (Split) – The indexed_indices Split
config (MaskedWordPieceDatasetConfig) – The config

Initialization

static numel_low_level_dataset( low_level_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset, ) → int#

Return the number of documents in the underlying low level dataset.

Parameters:: low_level_dataset (IndexedDataset) – The underlying IndexedDataset
Returns:: The number of unique elements in the underlying IndexedDataset
Return type:: int

static build_low_level_dataset( dataset_path: str, config: core.datasets.masked_dataset.MaskedWordPieceDatasetConfig, ) → megatron.core.datasets.indexed_dataset.IndexedDataset#

Build the low level dataset (IndexedDataset) from the given path.

Parameters:

dataset_path (str) – The real path prefix to the IndexedDataset .bin and .idx files
config (MaskedWordPieceDatasetConfig) – The config

Returns:

The underlying IndexedDataset

Return type:

IndexedDataset

static _key_config_attributes() → List[str]#

Inherited method implementation

Returns:: The key config attributes
Return type:: List[str]

__len__() → int#

_build_sample_index( sequence_length: int, min_sentences_per_sample: int, ) → numpy.ndarray#

_create_masked_lm_predictions( token_ids: List[int], target_sequence_length: int, numpy_random_state: numpy.random.RandomState, ) → Tuple[List[int], List[int], List[int], List[int], List[Tuple[List[int], List[int]]]]#

Creates the predictions for the masked LM objective

Parameters:

token_ids (List[int]) – The token ids
target_sequence_length (int) – The target sequence length
numpy_random_state (numpy.random.RandomState) – The NumPy random state

Returns:

masked_token_ids -> The masked sequence
masked_positions -> The indices for the masked token ids
masked_labels -> The original token ids for the masked token ids
boundaries -> The sentence and word boundaries for the sequence
masked_spans -> The masked positions and labels with N-gram info intact

Return type:

Tuple[List[int], List[int], List[int], List[int], List[Tuple[List[int], List[int]]]]

abstractmethod _get_token_mask( numpy_random_state: numpy.random.RandomState, ) → Optional[int]#

core.datasets.masked_dataset#

Module Contents#

Classes#

Data#

API#

`core.datasets.masked_dataset`#