models.t5 package

class core.models.T5.t5_model.T5LMHead(*args: Any, **kwargs: Any)

Bases: megatron.core.transformer.module.MegatronModule

Masked LM head for T5

  • config (TransformerConfig) – transformer config

  • parallel_output (bool) – wether output logits being distributed or not.

  • vocab_size (int) – vocabulary size

  • pre_process (bool) – Include embedding layer

  • share_embeddings_and_output_weights (bool) – When True, input embeddings and output logit weights are shared.

forward(hidden_states: torch.Tensor, word_embeddings_weight: torch.Tensor) → torch.Tensor

Forward pass.

  • hidden_states (Tensor) – output hidden states from decoder

  • word_embeddings_weight (Tensor) – word embedding weight


logits tensor

Return type


class core.models.T5.t5_model.T5Model(*args: Any, **kwargs: Any)

Bases: megatron.core.models.common.language_module.language_module.LanguageModule

T5 Language model.

  • config (TransformerConfig) – transformer config

  • transformer_encoder_layer_spec (ModuleSpec) – transformer layer customization specs for encoder

  • transformer_decoder_layer_spec (ModuleSpec) – transformer layer customization specs for decoder

  • vocab_size (int) – vocabulary size

  • max_sequence_length (int) – maximum size of sequence. This is used for positional embedding

  • pre_process (bool) – Include embedding layer (used with pipeline parallelism)

  • post_process (bool) – Include an output layer (used with pipeline parallelism)

  • fp16_lm_cross_entropy (bool, optional) – Defaults to False

  • parallel_output (bool) – Do not gather the outputs, keep them split across tensor parallel ranks

  • share_embeddings_and_output_weights (bool) – When True, input embeddings and output logit weights are shared. Defaults to False.

  • position_embedding_type (string) – Position embedding type. Options [‘learned_absolute’, ‘rope’]. Defaults is ‘learned_absolute’.

  • rotary_percent (float) – Percent of rotary dimension to use for rotary position embeddings. Defaults to 1.0 (100%). Ignored unless position_embedding_type is ‘rope’.

  • seq_len_interpolation_factor (float) – scale of linearly interpolating RoPE for longer sequences. The value must be a float larger than 1.0. Defaults to None.

forward(encoder_input_ids: torch.Tensor, decoder_input_ids: torch.Tensor, encoder_attn_mask: torch.Tensor, decoder_attn_mask: torch.Tensor, encoder_decoder_attn_mask: torch.Tensor, lm_labels: Optional[torch.Tensor] = None, inference_params: Optional[megatron.core.InferenceParams] = None) → torch.Tensor

Forward pass.

  • encoder_input_ids (Tensor) – input ids for encoder

  • decoder_input_ids (Tensor) – input ids for decoder

  • encoder_attn_mask (Tensor) – self-attention mask for encoder

  • decoder_attn_mask (Tensor) – self-attention mask for decoder

  • encoder_decoder_attn_mask (Tensor) – cross-attention mask between encoder and decoder

  • lm_labels (Tensor) – labels for decoder output

  • inference_params (InferenceParams) – relevant arguments for inferencing


loss tensor

Return type



See megatron.model.transformer.set_input_tensor()

sharded_state_dict(prefix: str = '', sharded_offsets: tuple = ()) → megatron.core.dist_checkpointing.mapping.ShardedStateDict

shared_embedding_or_output_weight() → torch.Tensor

Function to share the input embeddings and output logit weights.

core.models.T5.t5_model.t5_extended_attention_mask(attention_mask_list: List[torch.Tensor]) → List[torch.Tensor]

core.models.T5.t5_model.t5_position_ids(token_ids: torch.Tensor) → torch.Tensor

Calculate position ids from token ids :param token_ids: input tokens :type token_ids: Tensor


position ids

Return type


Previous models.gpt package
Next models.bert package
© Copyright 2022-2024, NVIDIA. Last updated on Mar 16, 2024.