NeMo Megatron API#

Pretraining Model Classes#

class nemo.collections.nlp.models.language_modeling.megatron_base_model.MegatronBaseModel(*args: Any, **kwargs: Any)[source]#

Bases: NLPModel

Megatron base class. All NeMo Megatron models inherit from this class.

  • Initialize the model parallel world for nemo.

  • Turn on all of the nvidia optimizations.

  • If cfg.tokenizer is available, it loads the tokenizer and pad the vocab to the correct size for tensor model parallelism.

  • If using distributed optimizer, configure to be compatible with O2 level optimizations and/or model parallelism.

  • Perform gradient clipping: grad_clip_pl_default triggers the PyTorch Lightning default implementation, with_distributed_adam triggers the distributed optimizer’s implementation, megatron_amp_o2 triggers gradient clipping on the main grads, and otherwise gradient clipping is performed on the model grads.

__init__(cfg: omegaconf.dictconfig.DictConfig, trainer: pytorch_lightning.trainer.trainer.Trainer, no_lm_init=True)[source]#

Base class from which all NeMo models should inherit

Parameters
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

configure_optimizers()[source]#
class nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel(*args: Any, **kwargs: Any)[source]#

Bases: MegatronBaseModel, TextGeneration

Megatron GPT pretraining

build_train_valid_test_datasets()[source]#
generate(inputs: Union[List[str], torch.Tensor, List[dict]], length_params: LengthParam, sampling_params: Optional[SamplingParam] = None) OutputType[source]#

Public method to generate text.

Parameters
  • inputs (Union[List[str], Tensor, List[dict]]) –

    Can be one of the 3 types: 1. List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.

    E.g [‘sentence’, ‘sentence2’ … ]

    1. Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder. The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences.

      E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))

    2. List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.
      E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”},

      {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.

  • length_params (LengthParam) –

    a dictionary type which controls the sampling length.

    max_length: int, The maximum length of the sequence to be generated. min_length: int, The minimum length of the sequence to be generated.

    If None, max_length is set to 30, and min_length is set to None

  • sampling_params (SamplingParam) –

    a dictionary type which contains the parameters for text sampling. It has the following keys

    use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering. top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation. repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty. add_BOS: bool, Whether add the bos token at the begining of the prompt all_probs: bool # whether return the log prob for all the tokens in vocab compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False

    Default None, If it is None, use_greedy will be “True”.

Returns

It generates the output in a dictionary type. It has the following keys:

sentences: List[str], output sentences tokens: List[List[str]], output sentences borken into tokens logprob: List[List[float]], log prob of generated tokens full_logprob: List[List[float]], log prob of all the tokens in the vocab token_ids: List[List[int]], output sentence token ids offsets: List[List[int]] # list of tokens start positions in text

Return type

OutputType

on_load_checkpoint(checkpoint) None[source]#

LightningModule hook: https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#on-load-checkpoint

on_save_checkpoint(checkpoint) None[source]#

LightningModule hook: https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#on-save-checkpoint

setup(stage=None)[source]#
PTL hook that is executed after DDP spawns.

We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.

Parameters

stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.

training_step(dataloader_iter, batch_idx)[source]#

We pass the dataloader iterator function to the micro-batch scheduler. The input batch to each micro-batch is fetched using the dataloader function in the micro-batch fwd function.

validation_step(dataloader_iter, batch_idx)[source]#

Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.

class nemo.collections.nlp.models.language_modeling.megatron_bert_model.MegatronBertModel(*args: Any, **kwargs: Any)[source]#

Bases: MegatronBaseModel

Megatron Bert pretraining. Model returns [batch, seq, hidden] shape

build_LDDL_data(cfg)[source]#
build_train_valid_test_datasets()[source]#
on_load_checkpoint(checkpoint) None[source]#

LightningModule hook: https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#on-load-checkpoint

on_save_checkpoint(checkpoint) None[source]#

LightningModule hook: https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#on-save-checkpoint

setup(stage=None)[source]#
PTL hook that is executed after DDP spawns.

We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.

Parameters

stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.

training_step(dataloader_iter, batch_idx)[source]#
validation_step(dataloader_iter, batch_idx)[source]#
class nemo.collections.nlp.models.language_modeling.megatron_bart_model.MegatronBARTModel(*args: Any, **kwargs: Any)[source]#

Bases: MegatronT5Model

Megatron BART pretraining

build_train_valid_test_datasets()#
setup(stage=None)#
PTL hook that is executed after DDP spawns.

We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.

Parameters

stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.

training_step(dataloader_iter, batch_idx)#

Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. Batch should be a list of microbatches and those microbatches should on CPU. Microbatches are then moved to GPU during the pipeline. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.

validation_step(dataloader_iter, batch_idx, dataloader_idx=0)#

return_values - if given, returns a dictionary with given keys and corresponding values

class nemo.collections.nlp.models.language_modeling.megatron_retrieval_model.MegatronRetrievalModel(*args: Any, **kwargs: Any)[source]#

Bases: MegatronBaseModel, TextGeneration

Megatron Retrieval enhanced language model

build_train_valid_test_datasets()[source]#
generate(inputs: Union[List[str], torch.Tensor, List[dict]], length_params: LengthParam, sampling_params: Optional[SamplingParam] = None, **args) OutputType[source]#

Public method to generate text.

Parameters
  • inputs (Union[List[str], Tensor, List[dict]]) –

    Can be one of the 3 types: 1. List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.

    E.g [‘sentence’, ‘sentence2’ … ]

    1. Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder. The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences.

      E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))

    2. List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.
      E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”},

      {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.

  • length_params (LengthParam) –

    a dictionary type which controls the sampling length.

    max_length: int, The maximum length of the sequence to be generated. min_length: int, The minimum length of the sequence to be generated.

    If None, max_length is set to 30, and min_length is set to None

  • sampling_params (SamplingParam) –

    a dictionary type which contains the parameters for text sampling. It has the following keys

    use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering. top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation. repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty. add_BOS: bool, Whether add the bos token at the begining of the prompt all_probs: bool # whether return the log prob for all the tokens in vocab compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False

    Default None, If it is None, use_greedy will be “True”.

Returns

It generates the output in a dictionary type. It has the following keys:

sentences: List[str], output sentences tokens: List[List[str]], output sentences borken into tokens logprob: List[List[float]], log prob of generated tokens full_logprob: List[List[float]], log prob of all the tokens in the vocab token_ids: List[List[int]], output sentence token ids offsets: List[List[int]] # list of tokens start positions in text

Return type

OutputType

setup(stage=None)[source]#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters

stage – fit, validate, test or predict

training_step(batch, batch_idx)[source]#
validation_step(batch, batch_idx)[source]#
class nemo.collections.nlp.models.language_modeling.megatron_t5_model.MegatronT5Model(*args: Any, **kwargs: Any)[source]#

Bases: MegatronLMEncoderDecoderModel

Megatron T5 pretraining

classmethod add_special_tokens_to_tokenizer(tokenizer, tokenizer_cfg, dataset_type='t5', add_sentinel_tokens_in_reverse_order=False, add_sentinel_tokens_first=False)[source]#
build_train_valid_test_datasets()[source]#
complete(request: Dict)#

Autoregressively invokes language model in the inference mode

Parameters

request – Dictionary with the following fields * prompt: a string which text the model should complete. * tokens_to_generate: how many tokens to generate while doing prompt completion.

Returns

A python dictionary with the following fields
  • prompt: original text of the prompt

  • tokenized_prompt: list of (str) tokens from prompt

  • completion: a python dictionary with the following subfields:
    • tokens: a list of triples (token, token_id, log_prob) comprising completion

    • text: completion text (as a single string)

Return type

response

decode(tokens_enc, enc_mask, num_tokens_to_generate, encoder_input=None, tokenizer=None, enc_output=None, enc_output_attn_mask=None, ignore_ids=[], bos_id=None, predicted_tokens_dec=None, sampling_method: str = 'greedy-search', sampling_kwargs: dict = {})#

tokens_enc - a tensor of shape [batch_size, seq_len] that contains the input tokens. enc_mask - a tensor of shape [batch_size, seq_len] that contains the input tokens mask (1 for active, 0 for inactive). num_tokens_to_generate - the max number of tokens to generate. encoder_input - a tensor of shape [batch_size, seq_len, hidden_size] that contains the encoder hidden states (replaces tokens_enc if given). tokenizer - a tokenizer object. enc_output - a tensor of shape [batch_size, seq_len, hidden_size] that contains the encoder hidden states (replaces tokens_enc and encoder_input if given). enc_output_attn_mask - a tensor of shape [batch_size, seq_len] that contains the encoder attention mask (replaces enc_mask if given). ignore_ids - a list of token ids to ignore when sampling. bos_id - the id of the beginning of sentence token. If None, will use tokenizer.bos_id unless explicitly set to something else. predicted_tokens_dec - a tensor of shape [batch_size, seq_len] that contains the tokens that have already been decoded. sampling_method - a sampling method to use in the decoding iterations. Currently supported methods are

“beam-search”/”greedy-search”/”topkp-sampling”. The argument specifies the sampling function that takes in a tensor of logits [batch_size, vocab_size] and returns a tuple (tensor of log_probs [batch_size], tensor of sampled tokens_ids from logits [batch_size]). If the beam search is enabled, the sampling function returns tensors [batch_size, beam_size]

sampling_kwargs - dict with arguments to be passed to the sampling function. Please refer to the method

get_sampling_token_fn to see which arguments are required for a chosen sampling_method.

Returns

tuple of tensors [batch_size, seq_len +1], [batch_size, seq_len] for predicted tokens and their log probs. If sampling_method == ‘beam-size’ and keep_only_best_tokens is False the shape of the tensors are [batch_size, beam_size, seq_len + 1], [batch_size, beam_size, seq_len]

encode(tokens_enc, enc_mask, encoder_input=None, reconfigure_microbatch=True)#

tokens_enc - encoder input tokens enc_mask - corresponding mask encoder_input - encoder input (bypass tokens), if given tokens_enc can be None.

setup(stage=None)#
PTL hook that is executed after DDP spawns.

We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.

Parameters

stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.

training_step(dataloader_iter, batch_idx)#

Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. Batch should be a list of microbatches and those microbatches should on CPU. Microbatches are then moved to GPU during the pipeline. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.

validation_step(dataloader_iter, batch_idx, dataloader_idx=0)#

return_values - if given, returns a dictionary with given keys and corresponding values

Customization Model Classes#

class nemo.collections.nlp.models.language_modeling.megatron_gpt_sft_model.MegatronGPTSFTModel(*args: Any, **kwargs: Any)[source]#

Bases: MegatronGPTModel

Megatron GPT Supervised Fine-Tuning

build_train_valid_test_datasets(stage)[source]#
generate(inputs: Union[List[str], torch.Tensor, List[dict]], length_params: LengthParam, sampling_params: Optional[SamplingParam] = None) OutputType#

Public method to generate text.

Parameters
  • inputs (Union[List[str], Tensor, List[dict]]) –

    Can be one of the 3 types: 1. List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.

    E.g [‘sentence’, ‘sentence2’ … ]

    1. Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder. The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences.

      E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))

    2. List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.
      E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”},

      {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.

  • length_params (LengthParam) –

    a dictionary type which controls the sampling length.

    max_length: int, The maximum length of the sequence to be generated. min_length: int, The minimum length of the sequence to be generated.

    If None, max_length is set to 30, and min_length is set to None

  • sampling_params (SamplingParam) –

    a dictionary type which contains the parameters for text sampling. It has the following keys

    use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering. top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation. repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty. add_BOS: bool, Whether add the bos token at the begining of the prompt all_probs: bool # whether return the log prob for all the tokens in vocab compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False

    Default None, If it is None, use_greedy will be “True”.

Returns

It generates the output in a dictionary type. It has the following keys:

sentences: List[str], output sentences tokens: List[List[str]], output sentences borken into tokens logprob: List[List[float]], log prob of generated tokens full_logprob: List[List[float]], log prob of all the tokens in the vocab token_ids: List[List[int]], output sentence token ids offsets: List[List[int]] # list of tokens start positions in text

Return type

OutputType

setup(stage=None)[source]#
PTL hook that is executed after DDP spawns.

We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.

Parameters

stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.

training_step(dataloader_iter, batch_idx)#

We pass the dataloader iterator function to the micro-batch scheduler. The input batch to each micro-batch is fetched using the dataloader function in the micro-batch fwd function.

validation_step(dataloader_iter, batch_idx, dataloader_idx=0)[source]#

Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.

class nemo.collections.nlp.models.language_modeling.megatron_gpt_adapter_model.MegatronGPTAdapterLearningModel(*args: Any, **kwargs: Any)[source]#

Bases: MegatronGPTBaseAdapterModel

MegatronGPTAdapterLearningModel is a model that combines a base model (GPTModel) with a adapters. This class only supports the canonical Adapter training described in Houlsby et al. (https://arxiv.org/pdf/1902.00751.pdf)

Two adapter’s are inserted into each Transformer layer in the base GPT Model.

It is assumed that these set of adapters will then be trained for a specific task. Once trained, the adapter weights will be saved and can be re-loaded and infused into the same GPT Model for inference.

__init__(cfg: omegaconf.dictconfig.DictConfig, trainer: pytorch_lightning.trainer.trainer.Trainer)[source]#

Base class from which all NeMo models should inherit

Parameters
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

generate(inputs: Union[List[str], torch.Tensor, List[dict]], length_params: LengthParam, sampling_params: Optional[SamplingParam] = None, batch_size: Optional[int] = 1)#

Public method to generate text.

Parameters
  • inputs (Union[List[str], Tensor, List[dict]]) –

    Can be one of the 3 types: 1. List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.

    E.g [‘sentence’, ‘sentence2’ … ]

    1. Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder. The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences.

      E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))

    2. List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.
      E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”},

      {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.

  • length_params (LengthParam) –

    a dictionary type which controls the sampling length.

    max_length: int, The maximum length of the sequence to be generated. min_length: int, The minimum length of the sequence to be generated.

    If None, max_length is set to 30, and min_length is set to None

  • sampling_params (SamplingParam) –

    a dictionary type which contains the parameters for text sampling. It has the following keys

    use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering. top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation. repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty. add_BOS: bool, Whether add the bos token at the begining of the prompt all_probs: bool # whether return the log prob for all the tokens in vocab compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False

    Default None, If it is None, use_greedy will be “True”.

Returns

It generates the output in a dictionary type. It has the following keys:

sentences: List[str], output sentences tokens: List[List[str]], output sentences borken into tokens logprob: List[List[float]], log prob of generated tokens full_logprob: List[List[float]], log prob of all the tokens in the vocab token_ids: List[List[int]], output sentence token ids offsets: List[List[int]] # list of tokens start positions in text

Return type

OutputType

setup(stage=None)#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters

stage – fit, validate, test or predict

state_dict(destination=None, prefix=None, keep_vars=False)#

Creates a state_dict using only the adapter parameters. This ensures that this wrapper class will only checkpoint the adapter weights and not the rest of the base GPT Model.

training_step(dataloader_iter, batch_idx)#
validation_step(dataloader_iter, batch_idx)#
class nemo.collections.nlp.models.language_modeling.megatron_gpt_adapter_model.MegatronGPTInfusedAdapterModel(*args: Any, **kwargs: Any)[source]#

Bases: MegatronGPTBaseAdapterModel

MegatronGPTInfusedAdapterModel is a model that combines a base model (GPTModel) with a “Infused Adapter that can Inhibiting and Amplify Inner Activations”, known as IA3. This class supports the addition of IA3 into a transformer based LM as described in Liu et al. (https://arxiv.org/pdf/2205.05638.pdf)

Three adapter’s are inserted into each Transformer layer in the base GPT Model. Each adapter is basically a vector that simply scales the key, value or ffn hidden representations.

It is assumed that these set of adapters will then be trained for a specific task. Once trained, the adapter weights will be saved and can be re-loaded and infused into the same GPT Model for inference.

__init__(cfg: omegaconf.dictconfig.DictConfig, trainer: pytorch_lightning.trainer.trainer.Trainer)[source]#

Base class from which all NeMo models should inherit

Parameters
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

generate(inputs: Union[List[str], torch.Tensor, List[dict]], length_params: LengthParam, sampling_params: Optional[SamplingParam] = None, batch_size: Optional[int] = 1)#

Public method to generate text.

Parameters
  • inputs (Union[List[str], Tensor, List[dict]]) –

    Can be one of the 3 types: 1. List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.

    E.g [‘sentence’, ‘sentence2’ … ]

    1. Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder. The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences.

      E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))

    2. List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.
      E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”},

      {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.

  • length_params (LengthParam) –

    a dictionary type which controls the sampling length.

    max_length: int, The maximum length of the sequence to be generated. min_length: int, The minimum length of the sequence to be generated.

    If None, max_length is set to 30, and min_length is set to None

  • sampling_params (SamplingParam) –

    a dictionary type which contains the parameters for text sampling. It has the following keys

    use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering. top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation. repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty. add_BOS: bool, Whether add the bos token at the begining of the prompt all_probs: bool # whether return the log prob for all the tokens in vocab compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False

    Default None, If it is None, use_greedy will be “True”.

Returns

It generates the output in a dictionary type. It has the following keys:

sentences: List[str], output sentences tokens: List[List[str]], output sentences borken into tokens logprob: List[List[float]], log prob of generated tokens full_logprob: List[List[float]], log prob of all the tokens in the vocab token_ids: List[List[int]], output sentence token ids offsets: List[List[int]] # list of tokens start positions in text

Return type

OutputType

setup(stage=None)#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters

stage – fit, validate, test or predict

state_dict(destination=None, prefix=None, keep_vars=False)#

Creates a state_dict using only the adapter parameters. This ensures that this wrapper class will only checkpoint the adapter weights and not the rest of the base GPT Model.

training_step(dataloader_iter, batch_idx)#
validation_step(dataloader_iter, batch_idx)#
class nemo.collections.nlp.models.language_modeling.megatron_gpt_prompt_learning_model.MegatronGPTPromptLearningModel(*args: Any, **kwargs: Any)[source]#

Bases: MegatronBasePromptLearningModel

Model class for prompt-tuning or p-tuning a pretrained Megatron GPT model.

Prompt Tuning initalizes virtual prompt embeddings directly from a copy of certain token embeddings from the the pretrained GPT model’s vocabulary and directly tunes these embedding weights. The token embeddings used in initalization are specified by the user in the config file. The model can be prompt-tuned for multiple tasks at once. virtual prompts are stored in a prompt table and can be added or deleted without disrupting virtual prompts for other tasks.

P-tuning initializes an LSTM encoder model that generates virtual prompt embeddings for every task. Each task shares the same encoder. After ptuning is compelete, the learned virtual prompts can be saved to the prompt table using add_ptuned_prompts_to_prompt_table(). Thus, if a user wants to add a new virtual prompt via p-tuning, they do not need to retrain on all previous tasks. This gives p-tuning the same task flexiblity as prompt-tuning.

generate(inputs: Union[List[str], torch.Tensor, List[dict]], length_params: LengthParam, sampling_params: Optional[SamplingParam] = None, batch_size: Optional[int] = 1)[source]#

Public method to generate text.

Parameters
  • inputs (Union[List[str], Tensor, List[dict]]) –

    Can be one of the 3 types: 1. List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.

    E.g [‘sentence’, ‘sentence2’ … ]

    1. Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder. The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences.

      E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))

    2. List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.
      E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”},

      {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.

  • length_params (LengthParam) –

    a dictionary type which controls the sampling length.

    max_length: int, The maximum length of the sequence to be generated. min_length: int, The minimum length of the sequence to be generated.

    If None, max_length is set to 30, and min_length is set to None

  • sampling_params (SamplingParam) –

    a dictionary type which contains the parameters for text sampling. It has the following keys

    use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering. top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation. repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty. add_BOS: bool, Whether add the bos token at the begining of the prompt all_probs: bool # whether return the log prob for all the tokens in vocab compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False

    Default None, If it is None, use_greedy will be “True”.

Returns

It generates the output in a dictionary type. It has the following keys:

sentences: List[str], output sentences tokens: List[List[str]], output sentences borken into tokens logprob: List[List[float]], log prob of generated tokens full_logprob: List[List[float]], log prob of all the tokens in the vocab token_ids: List[List[int]], output sentence token ids offsets: List[List[int]] # list of tokens start positions in text

Return type

OutputType

setup(stage=None)#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters

stage – fit, validate, test or predict

training_step(dataloader_iter, batch_idx)[source]#
validation_step(dataloader_iter, batch_idx)[source]#
class nemo.collections.nlp.models.language_modeling.megatron_t5_adapter_model.MegatronT5AdapterLearningModel(*args: Any, **kwargs: Any)[source]#

Bases: MegatronT5BaseAdapterModel

TODO (@adithyare)

__init__(cfg: omegaconf.dictconfig.DictConfig, trainer: pytorch_lightning.trainer.trainer.Trainer)[source]#

Base class from which all NeMo models should inherit

Parameters
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

setup(stage=None)#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters

stage – fit, validate, test or predict

state_dict(destination=None, prefix=None, keep_vars=False)#

Creates a state_dict using only the adapter parameters. This ensures that this wrapper class will only checkpoint the adapter weights and not the rest of the base GPT Model.

training_step(dataloader_iter, batch_idx)#
validation_step(dataloader_iter, batch_idx, inference=False)#
class nemo.collections.nlp.models.language_modeling.megatron_t5_adapter_model.MegatronT5AdapterLearningModel(*args: Any, **kwargs: Any)[source]#

Bases: MegatronT5BaseAdapterModel

TODO (@adithyare)

__init__(cfg: omegaconf.dictconfig.DictConfig, trainer: pytorch_lightning.trainer.trainer.Trainer)[source]#

Base class from which all NeMo models should inherit

Parameters
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

_add_adapters_to_component(component, component_cfg, adapter_name_keys)[source]#
setup(stage=None)#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters

stage – fit, validate, test or predict

state_dict(destination=None, prefix=None, keep_vars=False)#

Creates a state_dict using only the adapter parameters. This ensures that this wrapper class will only checkpoint the adapter weights and not the rest of the base GPT Model.

training_step(dataloader_iter, batch_idx)#
validation_step(dataloader_iter, batch_idx, inference=False)#
class nemo.collections.nlp.models.language_modeling.megatron_t5_adapter_model.MegatronT5InfusedAdapterModel(*args: Any, **kwargs: Any)[source]#

Bases: MegatronT5BaseAdapterModel

MegatronGPTInfusedAdapterModel is a model that combines a base model (GPTModel) with a “Infused Adapter that can Inhibiting and Amplify Inner Activations”, known as IA3. This class supports the addition of IA3 into a transformer based LM as described in Liu et al. (https://arxiv.org/pdf/2205.05638.pdf)

Three adapter’s are inserted into each Transformer layer in the base GPT Model. Each adapter is basically a vector that simply scales the key, value or ffn hidden representations.

It is assumed that these set of adapters will then be trained for a specific task. Once trained, the adapter weights will be saved and can be re-loaded and infused into the same GPT Model for inference.

__init__(cfg: omegaconf.dictconfig.DictConfig, trainer: pytorch_lightning.trainer.trainer.Trainer)[source]#

Base class from which all NeMo models should inherit

Parameters
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

_add_adapters_to_component(component, component_cfg, adapter_name_keys)[source]#
setup(stage=None)#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters

stage – fit, validate, test or predict

state_dict(destination=None, prefix=None, keep_vars=False)[source]#

Creates a state_dict using only the adapter parameters. This ensures that this wrapper class will only checkpoint the adapter weights and not the rest of the base GPT Model.

training_step(dataloader_iter, batch_idx)#
validation_step(dataloader_iter, batch_idx, inference=False)#

Modules#

class nemo.collections.nlp.modules.common.megatron.module.MegatronModule(*args: Any, **kwargs: Any)[source]#

Bases: Module

Megatron specific extensions of torch Module with support for pipelining.

decoder_cross_attention_relative_position_embeddings_weight()[source]#
decoder_relative_position_embeddings_weight()[source]#
encoder_relative_position_embeddings_weight()[source]#
initialize_word_embeddings(init_method, vocab_size, hidden_size, param_dtype=torch.float32)[source]#
position_embeddings_weight()[source]#
state_dict_for_save_checkpoint(destination=None, prefix='', keep_vars=False)[source]#

Use this function to override the state dict for saving checkpoints.

sync_initial_decoder_cross_attention_relative_position_embeddings()[source]#
sync_initial_decoder_relative_position_embeddings()[source]#
sync_initial_encoder_relative_position_embeddings()[source]#
sync_initial_position_embeddings()[source]#
sync_initial_word_embeddings()[source]#
word_embeddings_weight()[source]#
class nemo.collections.nlp.modules.common.megatron.module.Float16Module(*args: Any, **kwargs: Any)[source]#

Bases: MegatronModule

decoder_cross_attention_relative_position_embeddings_weight()[source]#
decoder_relative_position_embeddings_weight()[source]#
encoder_relative_position_embeddings_weight()[source]#
forward(*inputs, **kwargs)[source]#
position_embeddings_weight()[source]#
set_input_tensor(input_tensor)[source]#
state_dict(destination=None, prefix='', keep_vars=False)[source]#
state_dict_for_save_checkpoint(destination=None, prefix='', keep_vars=False)[source]#

Use this function to override the state dict for saving checkpoints.

word_embeddings_weight()[source]#
class nemo.collections.nlp.models.language_modeling.megatron.gpt_model.GPTModel(*args: Any, **kwargs: Any)[source]#

Bases: MegatronModule

GPT-2 Language model.

forward(input_ids, position_ids, attention_mask, labels=None, token_type_ids=None, layer_past=None, get_key_value=False, forward_method_parallel_output=None, encoder_input=None, set_inference_key_value_memory=False, inference_max_sequence_len=None, checkpoint_activations_all_layers=None)[source]#
class nemo.collections.nlp.models.language_modeling.megatron.bert_model.BertModel(*args: Any, **kwargs: Any)[source]#

Bases: MegatronModule

Bert Language model. Model returns [seq, batch, hidden] shape

forward(bert_model_input, attention_mask, token_type_ids=None, lm_labels=None, checkpoint_activations_all_layers=None)[source]#
class nemo.collections.nlp.modules.common.megatron.token_level_encoder_decoder.MegatronTokenLevelEncoderDecoderModule(*args: Any, **kwargs: Any)[source]#

Bases: MegatronModule

Token-based (input/output is tokens) encoder-decoder model (e.g. T5 Language model.)

forward(enc_input_ids=None, enc_attn_mask=None, dec_input_ids=None, dec_attn_mask=None, token_type_ids=None, labels=None, enc_output=None, enc_output_attn_mask=None, enc_input=None, output_enc_hidden_only=False)[source]#

Return value is per token / per dimension (i.e., non collapsed loss value)

class nemo.collections.nlp.modules.common.megatron.retrieval_token_level_encoder_decoder.MegatronRetrievalTokenLevelEncoderDecoderModule(*args: Any, **kwargs: Any)[source]#

Bases: MegatronModule

Token-based (input/output is tokens) retrieval encoder-decoder model

forward(input_ids, input_attn_mask, retrieved_ids, retrieved_attn_mask, token_type_ids=None, labels=None, input_emb=None, set_inference_key_value_memory=False, inference_max_sequence_len=None, neighbors=None, position_ids=None)[source]#

Return value is per token / per dimension (i.e., non collapsed loss value)

Datasets#

class nemo.collections.nlp.data.language_modeling.megatron.blendable_dataset.BlendableDataset(*args: Any, **kwargs: Any)[source]#

Bases: Dataset

create_data_mmap()[source]#
class nemo.collections.nlp.data.language_modeling.megatron.gpt_dataset.GPTDataset(*args: Any, **kwargs: Any)[source]#

Bases: Dataset

create_data_mmap()[source]#
class nemo.collections.nlp.data.language_modeling.megatron.gpt_dataset.MockGPTDataset(*args: Any, **kwargs: Any)[source]#

Bases: Dataset

class nemo.collections.nlp.data.language_modeling.megatron.bert_dataset.BertDataset(*args: Any, **kwargs: Any)[source]#

Bases: Dataset

class nemo.collections.nlp.data.language_modeling.megatron.base_prompt_learning_dataset.BasePromptLearningDataset(*args: Any, **kwargs: Any)[source]#

Bases: Dataset

The base dataset class for prompt-tuning or p-tuning. TODO: (@adithyare) should be merged into GPTPromptLearningDataset

pad_taskname_ids(taskname_ids)[source]#
class nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_dataset.GPTSFTDataset(*args: Any, **kwargs: Any)[source]#

Bases: Dataset

collate_fn(batch)[source]#

This is the method that user pass as functor to DataLoader. The method optionally performs neural type checking and add types to the outputs.

Please note, subclasses of Dataset should not implement input_types.

# Usage: dataloader = torch.utils.data.DataLoader(

…., collate_fn=dataset.collate_fn, ….

)

Returns

Collated batch, with or without types.

class nemo.collections.nlp.data.language_modeling.megatron.retro_dataset.RETRODataset(*args: Any, **kwargs: Any)[source]#

Bases: Dataset

Dataset for RETRO model.

It constructs single data record from the training/retrieval indexed retrieval dataset and knn index file. The KNN index file maps data chunk id to K-nearest neighbors in the the retrieval dataset chunk ids. First, it loads a long sequence (2048) from training dataset. Then for each chunk in the sequence, it finds the kNN chunks from the retrieval dataset using the KNN index. Lastly, compute the masks based on pad id.

class nemo.collections.nlp.data.language_modeling.megatron.t5_dataset.T5Dataset(*args: Any, **kwargs: Any)[source]#

Bases: Dataset

classmethod build_training_sample(sample, target_seq_length, np_rng, max_seq_length, max_seq_length_dec, masked_lm_prob, vocab_id_list, vocab_id_to_token_dict, cls_id, sep_id, mask_id, max_ngram_size, whole_word_masking, favor_long_ngrams, permutation, mean_ngram_size, geometric_dist, tokenizer_type, sentinel_tokens, bos_id, eos_id, pad_id, skip_masking_id=None)[source]#

Build training sample. :param sample: A list of sentences in which each sentence is a list token ids. :param target_seq_length: Desired sequence length. :param max_seq_length: Maximum length of the sequence. All values are padded to

this length.

Parameters
  • vocab_id_list – List of vocabulary ids. Used to pick a random id.

  • vocab_id_to_token_dict – A dictionary from vocab ids to text tokens.

  • cls_id – Start of example id.

  • sep_id – Separator id.

  • mask_id – Mask token id.

  • pad_id – Padding token id.

  • masked_lm_prob – Probability to mask tokens.

  • np_rng – Random number genenrator. Note that this rng state should be numpy and not python since python randint is inclusive for the opper bound whereas the numpy one is exclusive.

  • bos_id – start of decoder example id

  • eos_id – end of generation id

  • sentinel_tokens – unique value to be substituted for every replaced span

  • tokenizer_type – wordpiece (BERT-style) or sentencepiece tokenizer. Used for whole word masking logic.

  • max_ngram_size – maximum size of ngrams to be masked.

  • mean_ngram_size – mean size of ngrams to be masked (only used if geometric_dist=True).

  • geometric_dist – Uses a geometric distribution to sample ngram size.

  • permutation – Permutes the ngrams.

  • whole_word_masking – Always masks entire words instead of individual sub-word tokens.

  • favor_long_ngrams – Favor longer ngrams over shorter ones.

  • skip_masking_id – An id that will not be masked. TODO: Add supported for a list of IDs.

classmethod pad_and_convert_to_numpy(output_tokens, masked_positions, masked_labels, sentinel_tokens, bos_id, eos_id, pad_id, max_seq_length, max_seq_length_dec, masked_spans=None)[source]#

Pad sequences and convert them to numpy.

class nemo.collections.nlp.data.language_modeling.megatron.t5_prompt_learning_dataset.T5PromptLearningDataset(*args: Any, **kwargs: Any)[source]#

Bases: BasePromptLearningDataset

The dataset class for prompt-tuning or p-tuning pretrained T5 models.

collate_fn(batch)[source]#

Prepares enc_input, dec_input, labels, loss_mask, enc_mask, dec_mask, position_ids, taskname_ids for global batch

load_data(dataset)[source]#

Loads a dataset by filling in the task templates specified in the config file with the information from each training/inference example. Converts all input text into token ids. Also replaces the <|VIRTUAL_PROMPT_#|> placeholders in the task templates with the actual virtual prompt token ids.

params:
dataset: A list of json objects or a dictionary objects each

containing the information needed for a training example

pad_batch_and_build_loss_mask(enc_input, dec_input, dec_labels)[source]#

Pad enc_input, dec_input, labels in batch to max batch length while building loss_mask, enc_mask, and dec_mask

class nemo.collections.nlp.data.language_modeling.megatron.ul2_dataset.UL2Dataset(*args: Any, **kwargs: Any)[source]#

Bases: T5Dataset

UL2 Dataset from https://arxiv.org/abs/2205.05131. Consists of three different objectives: 1. Short span masking with small probabilities (ex: T5). Typically max ngram size of 5 with 0.15 mask prob. 2. Extreme span masking with either large probabilities or large ngram sizes or both. 3. Prefx-LM as in the T5 or LM-adapted T5 (prompt-tuning paper).

classmethod build_extreme_masking_training_sample(sample, target_seq_length, np_rng, max_seq_length, max_seq_length_dec, masked_lm_prob, extreme_masked_lm_prob, mask_id, max_ngram_size, min_ngram_size, mean_ngram_size, extreme_max_ngram_size, extreme_mean_ngram_size, extreme_min_ngram_size, extreme_ngram_span_length_distribution, sentinel_tokens, bos_id, eos_id, pad_id, skip_masking_id=None)[source]#

Build training sample. :param sample: A list of sentences in which each sentence is a list token ids. :param target_seq_length: Desired sequence length. :param max_seq_length: Maximum length of the sequence. All values are padded to

this length.

Parameters
  • vocab_id_list – List of vocabulary ids. Used to pick a random id.

  • vocab_id_to_token_dict – A dictionary from vocab ids to text tokens.

  • cls_id – Start of example id.

  • sep_id – Separator id.

  • mask_id – Mask token id.

  • pad_id – Padding token id.

  • masked_lm_prob – Probability to mask tokens.

  • np_rng – Random number genenrator. Note that this rng state should be numpy and not python since python randint is inclusive for the opper bound whereas the numpy one is exclusive.

  • bos_id – start of decoder example id

  • eos_id – end of generation id

  • sentinel_tokens – unique value to be substituted for every replaced span

  • tokenizer_type – wordpiece (BERT-style) or sentencepiece tokenizer. Used for whole word masking logic.

  • max_ngram_size – maximum size of ngrams to be masked.

  • mean_ngram_size – mean size of ngrams to be masked (only used if geometric_dist=True).

  • geometric_dist – Uses a geometric distribution to sample ngram size.

  • permutation – Permutes the ngrams.

  • whole_word_masking – Always masks entire words instead of individual sub-word tokens.

  • favor_long_ngrams – Favor longer ngrams over shorter ones.

  • skip_masking_id – id of the token to that will never be masked.

classmethod get_r_masking_training_sample(sample, tokenizer, np_rng, target_seq_length: int, max_seq_length: int, max_seq_length_dec: int, masked_lm_prob: float, vocab_id_list: list, vocab_id_to_token_dict: dict, max_ngram_size: int, mean_ngram_size: int, whole_word_masking: bool, favor_long_ngrams: bool, permutation: bool, geometric_dist: bool, tokenizer_type: str, sentinel_tokens: list, skip_masking_id: int)[source]#
classmethod get_s_masking_training_sample(sample, np_rng, max_seq_length_encoder: int, max_seq_length_decoder: int, tokenizer: TokenizerSpec, prefix_lm_pivot_mean: float, pivot_distribution: LengthDistribution, add_eos: bool = False)[source]#
classmethod get_x_masking_training_sample(sample, tokenizer, np_rng, target_seq_length: int, max_seq_length: int, max_seq_length_dec: int, masked_lm_prob: float, extreme_masked_lm_prob: float, max_ngram_size: int, min_ngram_size: int, mean_ngram_size: int, extreme_max_ngram_size: int, extreme_min_ngram_size: int, extreme_mean_ngram_size: int, extreme_ngram_span_length_distribution: LengthDistribution, sentinel_tokens: list, skip_masking_id: int)[source]#