core.datasets.bert_dataset#
Module Contents#
Classes#
Configuration object for Megatron Core BERT WordPiece datasets |
|
The BERT dataset that assumes WordPiece tokenization |
API#
- class core.datasets.bert_dataset.BERTMaskedWordPieceDatasetConfig#
Bases:
megatron.core.datasets.masked_dataset.MaskedWordPieceDatasetConfigConfiguration object for Megatron Core BERT WordPiece datasets
- classification_head: bool#
None
Option to perform the next sequence prediction during sampling
- __post_init__() None#
Do asserts and set fields post init
- class core.datasets.bert_dataset.BERTMaskedWordPieceDataset(
- indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
- dataset_path: str,
- indexed_indices: numpy.ndarray,
- num_samples: Optional[int],
- index_split: megatron.core.datasets.utils.Split,
- config: core.datasets.bert_dataset.BERTMaskedWordPieceDatasetConfig,
Bases:
megatron.core.datasets.masked_dataset.MaskedWordPieceDatasetThe BERT dataset that assumes WordPiece tokenization
- Parameters:
indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the MegatronDataset
dataset_path (str) – The real path on disk to the dataset, for bookkeeping
indexed_indices (numpy.ndarray) – The set of the documents indices to expose
num_samples (Optional[int]) – The number of samples to draw from the indexed dataset. When None, build as many samples as correspond to one epoch.
index_split (Split) – The indexed_indices Split
config (BERTMaskedWordPieceDatasetConfig) – The config
Initialization
- static _key_config_attributes() List[str]#
Inherited method implementation
- Returns:
The key config attributes
- Return type:
List[str]
- __getitem__(
- idx: int,
Abstract method implementation
- Parameters:
idx (int) – The index into the dataset
- Returns:
The sample information wrapped in a dictionary
- Return type:
Dict[str, Union[int, numpy.ndarray]]
- _get_token_mask(
- numpy_random_state: numpy.random.RandomState,
Abstract method implementation
80% of the time, replace the token id with mask token id. 10% of the time, replace token id with a random token id from the vocabulary. 10% of the time, do nothing.
- Parameters:
numpy_random_state (RandomState) – The NumPy random state
- Returns:
The replacement token id or None
- Return type:
Optional[int]