`core.datasets.bert_dataset`#

Module Contents#

Classes#

`BERTMaskedWordPieceDatasetConfig`	Configuration object for Megatron Core BERT WordPiece datasets
`BERTMaskedWordPieceDataset`	The BERT dataset that assumes WordPiece tokenization

API#

class core.datasets.bert_dataset.BERTMaskedWordPieceDatasetConfig#

Bases: megatron.core.datasets.masked_dataset.MaskedWordPieceDatasetConfig

Configuration object for Megatron Core BERT WordPiece datasets

classification_head: bool#

None

Option to perform the next sequence prediction during sampling

__post_init__() → None#: Do asserts and set fields post init

class core.datasets.bert_dataset.BERTMaskedWordPieceDataset( indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset, dataset_path: str, indexed_indices: numpy.ndarray, num_samples: Optional[int], index_split: megatron.core.datasets.utils.Split, config: core.datasets.bert_dataset.BERTMaskedWordPieceDatasetConfig, )#

Bases: megatron.core.datasets.masked_dataset.MaskedWordPieceDataset

The BERT dataset that assumes WordPiece tokenization

Parameters:

indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the MegatronDataset
dataset_path (str) – The real path on disk to the dataset, for bookkeeping
indexed_indices (numpy.ndarray) – The set of the documents indices to expose
num_samples (Optional[int]) – The number of samples to draw from the indexed dataset. When None, build as many samples as correspond to one epoch.
index_split (Split) – The indexed_indices Split
config (BERTMaskedWordPieceDatasetConfig) – The config

Initialization

static _key_config_attributes() → List[str]#

Inherited method implementation

Returns:: The key config attributes
Return type:: List[str]

__getitem__( idx: int, ) → Dict[str, Union[int, numpy.ndarray]]#

Abstract method implementation

Parameters:: idx (int) – The index into the dataset
Returns:: The sample information wrapped in a dictionary
Return type:: Dict[str, Union[int, numpy.ndarray]]

_get_token_mask( numpy_random_state: numpy.random.RandomState, ) → Optional[int]#

Abstract method implementation

80% of the time, replace the token id with mask token id. 10% of the time, replace token id with a random token id from the vocabulary. 10% of the time, do nothing.

Parameters:: numpy_random_state (RandomState) – The NumPy random state
Returns:: The replacement token id or None
Return type:: Optional[int]

core.datasets.bert_dataset#

Module Contents#

Classes#

API#

`core.datasets.bert_dataset`#