core.datasets.bert_dataset#

Module Contents#

Classes#

BERTMaskedWordPieceDatasetConfig

Configuration object for Megatron Core BERT WordPiece datasets

BERTMaskedWordPieceDataset

The BERT dataset that assumes WordPiece tokenization

API#

class core.datasets.bert_dataset.BERTMaskedWordPieceDatasetConfig#

Bases: megatron.core.datasets.masked_dataset.MaskedWordPieceDatasetConfig

Configuration object for Megatron Core BERT WordPiece datasets

classification_head: bool#

None

Option to perform the next sequence prediction during sampling

__post_init__() None#

Do asserts and set fields post init

class core.datasets.bert_dataset.BERTMaskedWordPieceDataset(
indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
dataset_path: str,
indexed_indices: numpy.ndarray,
num_samples: Optional[int],
index_split: megatron.core.datasets.utils.Split,
config: core.datasets.bert_dataset.BERTMaskedWordPieceDatasetConfig,
)#

Bases: megatron.core.datasets.masked_dataset.MaskedWordPieceDataset

The BERT dataset that assumes WordPiece tokenization

Parameters:
  • indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the MegatronDataset

  • dataset_path (str) – The real path on disk to the dataset, for bookkeeping

  • indexed_indices (numpy.ndarray) – The set of the documents indices to expose

  • num_samples (Optional[int]) – The number of samples to draw from the indexed dataset. When None, build as many samples as correspond to one epoch.

  • index_split (Split) – The indexed_indices Split

  • config (BERTMaskedWordPieceDatasetConfig) – The config

Initialization

static _key_config_attributes() List[str]#

Inherited method implementation

Returns:

The key config attributes

Return type:

List[str]

__getitem__(
idx: int,
) Dict[str, Union[int, numpy.ndarray]]#

Abstract method implementation

Parameters:

idx (int) – The index into the dataset

Returns:

The sample information wrapped in a dictionary

Return type:

Dict[str, Union[int, numpy.ndarray]]

_get_token_mask(
numpy_random_state: numpy.random.RandomState,
) Optional[int]#

Abstract method implementation

80% of the time, replace the token id with mask token id. 10% of the time, replace token id with a random token id from the vocabulary. 10% of the time, do nothing.

Parameters:

numpy_random_state (RandomState) – The NumPy random state

Returns:

The replacement token id or None

Return type:

Optional[int]