core.datasets.masked_dataset#

Module Contents#

Classes#

MaskedWordPieceDatasetConfig

Configuration object for Megatron Core Masked WordPiece datasets

MaskedWordPieceDataset

The semi-abstract base class for masked WordPiece datasets

Data#

API#

core.datasets.masked_dataset.logger#

‘getLogger(…)’

class core.datasets.masked_dataset.MaskedWordPieceDatasetConfig#

Bases: megatron.core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfig

Configuration object for Megatron Core Masked WordPiece datasets

masking_probability: float#

None

The probability we mask a candidate N-gram

short_sequence_probability: float#

None

The probability we return a sequence shorter than the target sequence length

masking_max_ngram: int#

None

The maximum length N-gram to consider masking or permuting

masking_do_full_word: bool#

None

Whether we mask the whole word or its component parts

masking_do_permutation: bool#

None

Whether we shuffle a subset of candidate N-grams in addition

masking_use_longer_ngrams: bool#

None

Whether to favor longer N-grams over shorter N-grams

masking_use_geometric_distribution: bool#

None

Whether to draw the size of the N-gram from a geometric distribution according to SpanBERT https://arxiv.org/abs/1907.10529 (Section 3.1)

__post_init__() None#

Do asserts and set fields post init

class core.datasets.masked_dataset.MaskedWordPieceDataset(
indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
dataset_path: str,
indexed_indices: numpy.ndarray,
num_samples: Optional[int],
index_split: megatron.core.datasets.utils.Split,
config: core.datasets.masked_dataset.MaskedWordPieceDatasetConfig,
)#

Bases: megatron.core.datasets.megatron_dataset.MegatronDataset

The semi-abstract base class for masked WordPiece datasets

This implementation makes the rigid assumption that all inheritor datasets are built upon the IndexedDataset class. This assumption may be pushed down to the inheritors in future if necessary.

NB: WordPiece tokenization prepends a double hash “##” to all tokens/pieces in a word, save the first token/piece.

Parameters:
  • indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the MegatronDataset

  • dataset_path (str) – The real path on disk to the dataset, for bookkeeping

  • indexed_indices (numpy.ndarray) – The set of the documents indices to expose

  • num_samples (Optional[int]) – The number of samples to draw from the indexed dataset. When None, build as many samples as correspond to one epoch.

  • index_split (Split) – The indexed_indices Split

  • config (MaskedWordPieceDatasetConfig) – The config

Initialization

static numel_low_level_dataset(
low_level_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
) int#

Return the number of documents in the underlying low level dataset.

Parameters:

low_level_dataset (IndexedDataset) – The underlying IndexedDataset

Returns:

The number of unique elements in the underlying IndexedDataset

Return type:

int

static build_low_level_dataset(
dataset_path: str,
config: core.datasets.masked_dataset.MaskedWordPieceDatasetConfig,
) megatron.core.datasets.indexed_dataset.IndexedDataset#

Build the low level dataset (IndexedDataset) from the given path.

Parameters:
  • dataset_path (str) – The real path prefix to the IndexedDataset .bin and .idx files

  • config (MaskedWordPieceDatasetConfig) – The config

Returns:

The underlying IndexedDataset

Return type:

IndexedDataset

static _key_config_attributes() List[str]#

Inherited method implementation

Returns:

The key config attributes

Return type:

List[str]

__len__() int#
_build_sample_index(
sequence_length: int,
min_sentences_per_sample: int,
) numpy.ndarray#
_create_masked_lm_predictions(
token_ids: List[int],
target_sequence_length: int,
numpy_random_state: numpy.random.RandomState,
) Tuple[List[int], List[int], List[int], List[int], List[Tuple[List[int], List[int]]]]#

Creates the predictions for the masked LM objective

Parameters:
  • token_ids (List[int]) – The token ids

  • target_sequence_length (int) – The target sequence length

  • numpy_random_state (numpy.random.RandomState) – The NumPy random state

Returns:

  1. masked_token_ids -> The masked sequence

  2. masked_positions -> The indices for the masked token ids

  3. masked_labels -> The original token ids for the masked token ids

  4. boundaries -> The sentence and word boundaries for the sequence

  5. masked_spans -> The masked positions and labels with N-gram info intact

Return type:

Tuple[List[int], List[int], List[int], List[int], List[Tuple[List[int], List[int]]]]

abstractmethod _get_token_mask(
numpy_random_state: numpy.random.RandomState,
) Optional[int]#