core.datasets.t5_dataset#

Module Contents#

Classes#

T5MaskedWordPieceDatasetConfig

Configuration object for Megatron Core T5 WordPiece datasets

T5MaskedWordPieceDataset

The T5 dataset that assumes WordPiece tokenization

API#

class core.datasets.t5_dataset.T5MaskedWordPieceDatasetConfig#

Bases: megatron.core.datasets.masked_dataset.MaskedWordPieceDatasetConfig

Configuration object for Megatron Core T5 WordPiece datasets

NB: As a temporary holdover from Megatron-LM. The T5 tokenizer has an attribute which defines a number of special sentinel tokens used during sampling. The assert in post_init serves to preserve compatibility with Megatron-LM until the T5 tokenizer is in Megatron Core.

sequence_length_encoder: Optional[int]#

‘field(…)’

A sequence_length alias and the sequence length for the encoder

sequence_length_decoder: int#

None

The sequence length for the decoder

__post_init__() None#

Do asserts and set fields post init

class core.datasets.t5_dataset.T5MaskedWordPieceDataset(
indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
dataset_path: str,
indexed_indices: numpy.ndarray,
num_samples: Optional[int],
index_split: megatron.core.datasets.utils.Split,
config: core.datasets.t5_dataset.T5MaskedWordPieceDatasetConfig,
)#

Bases: megatron.core.datasets.masked_dataset.MaskedWordPieceDataset

The T5 dataset that assumes WordPiece tokenization

Parameters:
  • indexed_dataset (IndexedDataset) – The IndexedDataset around which to build the MegatronDataset

  • dataset_path (str) – The real path on disk to the dataset, for bookkeeping

  • indexed_indices (numpy.ndarray) – The set of the documents indices to expose

  • num_samples (Optional[int]) – The number of samples to draw from the indexed dataset. When None, build as many samples as correspond to one epoch.

  • index_split (Split) – The indexed_indices Split

  • config (T5MaskedWordPieceDatasetConfig) – The config

Initialization

static _key_config_attributes() List[str]#

Inherited method implementation

Returns:

The key config attributes

Return type:

List[str]

static _build_b1ss_attention_mask(
source_block: torch.tensor,
target_block: torch.tensor,
make_history_mask: bool = False,
) torch.tensor#

Build an attention-mask having shape (bs, 1, q_len, kv_len) from source_block and target_block

Parameters:
  • source_block (torch.tensor) – A 2-D array of tokens (bs, q_len)

  • target_block (torch.tensor) – A 2-D array of tokens (bs, kv_len)

  • make_history_mask (bool) – Whether to turn mask into causal mask

Returns:

The 4-D attention mask (bs, 1, q_len, kv_len)

Return type:

torch.tensor

static config_attention_mask(
encoder_tokens: torch.tensor,
decoder_tokens: torch.tensor,
encoder_mask: torch.tensor,
decoder_mask: torch.tensor,
use_local: bool = False,
test_te_version: str = None,
) torch.tensor#

Config attention-mask for encoder_mask, decoder_mask, encoder_decoder_mask conditioned on transformer-implementation (e.g. TE vs local), TE versions, and TE backends

Parameters:
  • encoder_tokens (torch.tensor) – A 2-D array of tokens (bs, kv_len)

  • decoder_tokens (torch.tensor) – A 2-D array of tokens (bs, q_len)

  • encoder_mask (torch.tensor) – A 2-D array of tokens (bs, kv_len)

  • decoder_mask (torch.tensor) – A 2-D array of tokens (bs, q_len)

  • use_local (bool) – Whether the current T5 model uses local (vs TE) transformer implmentation

  • test_te_version (str) – The Transformer Engine version to test against. Defaults to None.

Returns:

Configured encoder_mask, decoder_mask, encoder_decoder_mask torch.tensor: configured encoder attention mask torch.tensor: configured decoder attention mask torch.tensor: configured encoder-decoder attention mask

__getitem__(
idx: int,
) Dict[str, Union[int, numpy.ndarray]]#

Abstract method implementation

Parameters:

idx (int) – The index into the dataset

Returns:

The sample data including encoder input, decoder input/output, and masks.

Return type:

Dict[str, Union[int, numpy.ndarray]]

_get_token_mask(numpy_random_state: numpy.random.RandomState) int#

Abstract method implementation

100% of the time, replace the token id with mask token id.

Parameters:

numpy_random_state (RandomState) – The NumPy random state

Returns:

The mask token id

Return type:

int