Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

NeMo Large language Model API#

Pretraining Model Classes#

class nemo.collections.nlp.models.language_modeling.megatron_base_model.MegatronBaseModel(*args: Any, **kwargs: Any)#

Bases: NLPModel

Megatron base class. All NeMo Megatron models inherit from this class.

  • Initialize the model parallel world for nemo.

  • Turn on all of the nvidia optimizations.

  • If cfg.tokenizer is available, it loads the tokenizer and pad the vocab to the correct size for tensor model parallelism.

  • If using distributed optimizer, configure to be compatible with O2 level optimizations and/or model parallelism.

  • Perform gradient clipping: grad_clip_pl_default triggers the PyTorch Lightning default implementation, with_distributed_adam triggers the distributed optimizer’s implementation, megatron_amp_O2 triggers gradient clipping on the main grads, and otherwise gradient clipping is performed on the model grads.

__init__(
cfg: omegaconf.dictconfig.DictConfig,
trainer: pytorch_lightning.trainer.trainer.Trainer,
no_lm_init=True,
)#

Base class from which all NeMo models should inherit

Parameters:
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

class nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel(*args: Any, **kwargs: Any)#

Bases: MegatronBaseModel, TextGeneration

Megatron GPT pretraining

generate(
inputs: List[str] | torch.Tensor | List[dict],
length_params: LengthParam,
sampling_params: SamplingParam | None = None,
*,
strategy: TextGenerationStrategy | None = None,
) OutputType#

Public method to generate text.

Parameters:
  • inputs (Union[List[str], Tensor, List[dict]]) –

    Can be one of the 3 types:

    1. List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.

      E.g [‘sentence’, ‘sentence2’ … ]

    2. Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder.

      The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences. E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))

    3. List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.

      E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”}, {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.

  • length_params (LengthParam) –

    a dictionary type which controls the sampling length.

    • max_length: int, The maximum length of the sequence to be generated.

    • min_length: int, The minimum length of the sequence to be generated.

    If None, max_length is set to 30, and min_length is set to None

  • sampling_params (SamplingParam) –

    a dictionary type which contains the parameters for text sampling. It has the following keys

    • use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise

    • top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering.

    • top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.

    • repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty.

    • add_BOS: bool, Whether add the bos token at the begining of the prompt

    • all_probs: bool # whether return the log prob for all the tokens in vocab

    • compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False

    • end_strings: List[str] # generation will stop when one of these tokens is generated

    Default None, If it is None, use_greedy will be “True”.

Returns:

It generates the output in a dictionary type. It has the following keys,

  • sentences: List[str], output sentences

  • tokens: List[List[str]], output sentences borken into tokens

  • logprob: List[List[float]], log prob of generated tokens

  • full_logprob: List[List[float]], log prob of all the tokens in the vocab

  • token_ids: List[List[int]], output sentence token ids

  • offsets: List[List[int]] # list of tokens start positions in text

on_load_checkpoint(checkpoint) None#

LightningModule hook: https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#on-load-checkpoint

on_save_checkpoint(checkpoint) None#

LightningModule hook: https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#on-save-checkpoint

setup(stage=None)#

PTL hook that is executed after DDP spawns. We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.

Parameters:

stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.

training_step(dataloader_iter)#

We pass the dataloader iterator function to the micro-batch scheduler. The input batch to each micro-batch is fetched using the dataloader function in the micro-batch fwd function.

validation_step(dataloader_iter, dataloader_idx=0)#

Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.

class nemo.collections.nlp.models.language_modeling.megatron_bert_model.MegatronBertModel(*args: Any, **kwargs: Any)#

Bases: MegatronBaseModel

Megatron Bert pretraining. Model returns [batch, seq, hidden] shape

on_load_checkpoint(checkpoint) None#

LightningModule hook: https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#on-load-checkpoint

on_save_checkpoint(checkpoint) None#

LightningModule hook: https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#on-save-checkpoint

setup(stage=None)#

PTL hook that is executed after DDP spawns. We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.

Parameters:

stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.

class nemo.collections.nlp.models.language_modeling.megatron_bart_model.MegatronBARTModel(*args: Any, **kwargs: Any)#

Bases: MegatronT5Model

Megatron BART pretraining

setup(stage=None)#

PTL hook that is executed after DDP spawns. We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.

Parameters:

stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.

training_step(dataloader_iter)#

Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. Batch should be a list of microbatches and those microbatches should on CPU. Microbatches are then moved to GPU during the pipeline. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.

validation_step(dataloader_iter)#

return_values - if given, returns a dictionary with given keys and corresponding values

class nemo.collections.nlp.models.language_modeling.megatron_retrieval_model.MegatronRetrievalModel(*args: Any, **kwargs: Any)#

Bases: MegatronBaseModel, TextGeneration

Megatron Retrieval enhanced language model

generate(
inputs: List[str] | torch.Tensor | List[dict],
length_params: LengthParam,
sampling_params: SamplingParam | None = None,
**args,
) OutputType#

Public method to generate text.

Parameters:
  • inputs (Union[List[str], Tensor, List[dict]]) –

    Can be one of the 3 types:

    1. List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.

      E.g [‘sentence’, ‘sentence2’ … ]

    2. Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder.

      The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences. E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))

    3. List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.

      E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”}, {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.

  • length_params (LengthParam) –

    a dictionary type which controls the sampling length.

    • max_length: int, The maximum length of the sequence to be generated.

    • min_length: int, The minimum length of the sequence to be generated.

    If None, max_length is set to 30, and min_length is set to None

  • sampling_params (SamplingParam) –

    a dictionary type which contains the parameters for text sampling. It has the following keys

    • use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise

    • top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering.

    • top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.

    • repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty.

    • add_BOS: bool, Whether add the bos token at the begining of the prompt

    • all_probs: bool # whether return the log prob for all the tokens in vocab

    • compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False

    • end_strings: List[str] # generation will stop when one of these tokens is generated

    Default None, If it is None, use_greedy will be “True”.

Returns:

It generates the output in a dictionary type. It has the following keys,

  • sentences: List[str], output sentences

  • tokens: List[List[str]], output sentences borken into tokens

  • logprob: List[List[float]], log prob of generated tokens

  • full_logprob: List[List[float]], log prob of all the tokens in the vocab

  • token_ids: List[List[int]], output sentence token ids

  • offsets: List[List[int]] # list of tokens start positions in text

setup(stage=None)#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters:

stage – fit, validate, test or predict

class nemo.collections.nlp.models.language_modeling.megatron_t5_model.MegatronT5Model(*args: Any, **kwargs: Any)#

Bases: MegatronLMEncoderDecoderModel

Megatron T5 pretraining

complete(request: Dict)#

Autoregressively invokes language model in the inference mode

Parameters:

request – Dictionary with the following fields * prompt: a string which text the model should complete. * tokens_to_generate: how many tokens to generate while doing prompt completion.

Returns:

A python dictionary with the following fields
  • prompt: original text of the prompt

  • tokenized_prompt: list of (str) tokens from prompt

  • completion: a python dictionary with the following subfields:
    • tokens: a list of triples (token, token_id, log_prob) comprising completion

    • text: completion text (as a single string)

Return type:

response

decode(
tokens_enc,
enc_mask,
num_tokens_to_generate,
encoder_input=None,
tokenizer=None,
enc_output=None,
enc_output_attn_mask=None,
ignore_ids=[],
bos_id=None,
predicted_tokens_dec=None,
batch_data=None,
sampling_method: str = 'greedy-search',
sampling_kwargs: dict = {},
)#
Parameters:
  • tokens_enc – a tensor of shape [batch_size, seq_len] that contains the input tokens.

  • enc_mask – a tensor of shape [batch_size, seq_len] that contains the input tokens mask (1 for active, 0 for inactive).

  • num_tokens_to_generate – the max number of tokens to generate.

  • encoder_input – a tensor of shape [batch_size, seq_len, hidden_size] that contains the encoder hidden states (replaces tokens_enc if given).

  • tokenizer – a tokenizer object.

  • enc_output – a tensor of shape [batch_size, seq_len, hidden_size] that contains the encoder hidden states (replaces tokens_enc and encoder_input if given).

  • enc_output_attn_mask – a tensor of shape [batch_size, seq_len] that contains the encoder attention mask (replaces enc_mask if given).

  • ignore_ids – a list of token ids to ignore when sampling.

  • bos_id – the id of the beginning of sentence token. If None, will use tokenizer.bos_id unless explicitly set to something else.

  • predicted_tokens_dec – a tensor of shape [batch_size, seq_len] that contains the tokens that have already been decoded.

  • sampling_method – a sampling method to use in the decoding iterations. Currently supported methods are “beam-search”/”greedy-search”/”topkp-sampling”. The argument specifies the sampling function that takes in a tensor of logits [batch_size, vocab_size] and returns a tuple (tensor of log_probs [batch_size], tensor of sampled tokens_ids from logits [batch_size]). If the beam search is enabled, the sampling function returns tensors [batch_size, beam_size]

  • sampling_kwargs – dict with arguments to be passed to the sampling function. Please refer to the method get_sampling_token_fn to see which arguments are required for a chosen sampling_method.

Returns:

tuple of tensors [batch_size, seq_len +1], [batch_size, seq_len] for predicted tokens and their log probs. If sampling_method == ‘beam-size’ and keep_only_best_tokens is False the shape of the tensors are [batch_size, beam_size, seq_len + 1], [batch_size, beam_size, seq_len]

encode(
tokens_enc,
enc_mask,
encoder_input=None,
batch_data=None,
reconfigure_microbatch=True,
)#
Parameters:
  • tokens_enc – encoder input tokens

  • enc_mask – corresponding mask

  • encoder_input – encoder input (bypass tokens), if given tokens_enc can be None.

  • batch_data – passed directly to all hidden transformations and losses. Can be used to pass additional data like class label. Format is not defined and should match the expected format of the used hiddens modules.

setup(stage=None)#

PTL hook that is executed after DDP spawns. We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.

Parameters:

stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.

training_step(dataloader_iter)#

Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. Batch should be a list of microbatches and those microbatches should on CPU. Microbatches are then moved to GPU during the pipeline. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.

validation_step(dataloader_iter)#

return_values - if given, returns a dictionary with given keys and corresponding values

Customization Model Classes#

class nemo.collections.nlp.models.language_modeling.megatron_gpt_sft_model.MegatronGPTSFTModel(*args: Any, **kwargs: Any)#

Bases: NLPAdapterModelMixin, MegatronGPTModel

Megatron GPT Supervised Fine-Tuning

generate(
inputs: List[str] | torch.Tensor | List[dict],
length_params: LengthParam,
sampling_params: SamplingParam | None = None,
*,
strategy: TextGenerationStrategy | None = None,
) OutputType#

Public method to generate text.

Parameters:
  • inputs (Union[List[str], Tensor, List[dict]]) –

    Can be one of the 3 types:

    1. List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.

      E.g [‘sentence’, ‘sentence2’ … ]

    2. Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder.

      The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences. E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))

    3. List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.

      E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”}, {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.

  • length_params (LengthParam) –

    a dictionary type which controls the sampling length.

    • max_length: int, The maximum length of the sequence to be generated.

    • min_length: int, The minimum length of the sequence to be generated.

    If None, max_length is set to 30, and min_length is set to None

  • sampling_params (SamplingParam) –

    a dictionary type which contains the parameters for text sampling. It has the following keys

    • use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise

    • top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering.

    • top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.

    • repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty.

    • add_BOS: bool, Whether add the bos token at the begining of the prompt

    • all_probs: bool # whether return the log prob for all the tokens in vocab

    • compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False

    • end_strings: List[str] # generation will stop when one of these tokens is generated

    Default None, If it is None, use_greedy will be “True”.

Returns:

It generates the output in a dictionary type. It has the following keys,

  • sentences: List[str], output sentences

  • tokens: List[List[str]], output sentences borken into tokens

  • logprob: List[List[float]], log prob of generated tokens

  • full_logprob: List[List[float]], log prob of all the tokens in the vocab

  • token_ids: List[List[int]], output sentence token ids

  • offsets: List[List[int]] # list of tokens start positions in text

setup(stage=None)#

PTL hook that is executed after DDP spawns. We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.

Parameters:

stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.

training_step(dataloader_iter)#

We pass the dataloader iterator function to the micro-batch scheduler. The input batch to each micro-batch is fetched using the dataloader function in the micro-batch fwd function.

validation_step(dataloader_iter)#

Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.

class nemo.collections.nlp.models.language_modeling.megatron_gpt_adapter_model.MegatronGPTAdapterLearningModel(*args: Any, **kwargs: Any)#

Bases: MegatronGPTBaseAdapterModel

MegatronGPTAdapterLearningModel is a model that combines a base model (GPTModel) with a adapters. This class only supports the canonical Adapter training described in Houlsby et al. (https://arxiv.org/pdf/1902.00751.pdf)

Two adapter’s are inserted into each Transformer layer in the base GPT Model.

It is assumed that these set of adapters will then be trained for a specific task. Once trained, the adapter weights will be saved and can be re-loaded and infused into the same GPT Model for inference.

__init__(
cfg: omegaconf.dictconfig.DictConfig,
trainer: pytorch_lightning.trainer.trainer.Trainer,
)#

Base class from which all NeMo models should inherit

Parameters:
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

generate(
inputs: List[str] | torch.Tensor | List[dict],
length_params: LengthParam,
sampling_params: SamplingParam | None = None,
batch_size: int | None = 1,
)#

Public method to generate text.

Parameters:
  • inputs (Union[List[str], Tensor, List[dict]]) –

    Can be one of the 3 types:

    1. List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.

      E.g [‘sentence’, ‘sentence2’ … ]

    2. Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder.

      The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences. E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))

    3. List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.

      E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”}, {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.

  • length_params (LengthParam) –

    a dictionary type which controls the sampling length.

    • max_length: int, The maximum length of the sequence to be generated.

    • min_length: int, The minimum length of the sequence to be generated.

    If None, max_length is set to 30, and min_length is set to None

  • sampling_params (SamplingParam) –

    a dictionary type which contains the parameters for text sampling. It has the following keys

    • use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise

    • top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering.

    • top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.

    • repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty.

    • add_BOS: bool, Whether add the bos token at the begining of the prompt

    • all_probs: bool # whether return the log prob for all the tokens in vocab

    • compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False

    • end_strings: List[str] # generation will stop when one of these tokens is generated

    Default None, If it is None, use_greedy will be “True”.

Returns:

It generates the output in a dictionary type. It has the following keys,

  • sentences: List[str], output sentences

  • tokens: List[List[str]], output sentences borken into tokens

  • logprob: List[List[float]], log prob of generated tokens

  • full_logprob: List[List[float]], log prob of all the tokens in the vocab

  • token_ids: List[List[int]], output sentence token ids

  • offsets: List[List[int]] # list of tokens start positions in text

setup(stage=None)#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters:

stage – fit, validate, test or predict

state_dict(
destination=None,
prefix=None,
keep_vars=False,
)#

Creates a state_dict using only the adapter parameters. This ensures that this wrapper class will only checkpoint the adapter weights and not the rest of the base GPT Model.

class nemo.collections.nlp.models.language_modeling.megatron_gpt_adapter_model.MegatronGPTInfusedAdapterModel(*args: Any, **kwargs: Any)#

Bases: MegatronGPTBaseAdapterModel

MegatronGPTInfusedAdapterModel is a model that combines a base model (GPTModel) with a “Infused Adapter that can Inhibiting and Amplify Inner Activations”, known as IA3. This class supports the addition of IA3 into a transformer based LM as described in Liu et al. (https://arxiv.org/pdf/2205.05638.pdf)

Three adapter’s are inserted into each Transformer layer in the base GPT Model. Each adapter is basically a vector that simply scales the key, value or ffn hidden representations.

It is assumed that these set of adapters will then be trained for a specific task. Once trained, the adapter weights will be saved and can be re-loaded and infused into the same GPT Model for inference.

__init__(
cfg: omegaconf.dictconfig.DictConfig,
trainer: pytorch_lightning.trainer.trainer.Trainer,
)#

Base class from which all NeMo models should inherit

Parameters:
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

generate(
inputs: List[str] | torch.Tensor | List[dict],
length_params: LengthParam,
sampling_params: SamplingParam | None = None,
batch_size: int | None = 1,
)#

Public method to generate text.

Parameters:
  • inputs (Union[List[str], Tensor, List[dict]]) –

    Can be one of the 3 types:

    1. List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.

      E.g [‘sentence’, ‘sentence2’ … ]

    2. Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder.

      The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences. E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))

    3. List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.

      E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”}, {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.

  • length_params (LengthParam) –

    a dictionary type which controls the sampling length.

    • max_length: int, The maximum length of the sequence to be generated.

    • min_length: int, The minimum length of the sequence to be generated.

    If None, max_length is set to 30, and min_length is set to None

  • sampling_params (SamplingParam) –

    a dictionary type which contains the parameters for text sampling. It has the following keys

    • use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise

    • top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering.

    • top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.

    • repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty.

    • add_BOS: bool, Whether add the bos token at the begining of the prompt

    • all_probs: bool # whether return the log prob for all the tokens in vocab

    • compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False

    • end_strings: List[str] # generation will stop when one of these tokens is generated

    Default None, If it is None, use_greedy will be “True”.

Returns:

It generates the output in a dictionary type. It has the following keys,

  • sentences: List[str], output sentences

  • tokens: List[List[str]], output sentences borken into tokens

  • logprob: List[List[float]], log prob of generated tokens

  • full_logprob: List[List[float]], log prob of all the tokens in the vocab

  • token_ids: List[List[int]], output sentence token ids

  • offsets: List[List[int]] # list of tokens start positions in text

setup(stage=None)#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters:

stage – fit, validate, test or predict

state_dict(
destination=None,
prefix=None,
keep_vars=False,
)#

Creates a state_dict using only the adapter parameters. This ensures that this wrapper class will only checkpoint the adapter weights and not the rest of the base GPT Model.

class nemo.collections.nlp.models.language_modeling.megatron_gpt_prompt_learning_model.MegatronGPTPromptLearningModel(*args: Any, **kwargs: Any)#

Bases: MegatronBasePromptLearningModel

Model class for prompt-tuning or p-tuning a pretrained Megatron GPT model.

Prompt Tuning initalizes virtual prompt embeddings directly from a copy of certain token embeddings from the the pretrained GPT model’s vocabulary and directly tunes these embedding weights. The token embeddings used in initalization are specified by the user in the config file. The model can be prompt-tuned for multiple tasks at once. virtual prompts are stored in a prompt table and can be added or deleted without disrupting virtual prompts for other tasks.

P-tuning initializes an LSTM encoder model that generates virtual prompt embeddings for every task. Each task shares the same encoder. After ptuning is compelete, the learned virtual prompts can be saved to the prompt table using add_ptuned_prompts_to_prompt_table(). Thus, if a user wants to add a new virtual prompt via p-tuning, they do not need to retrain on all previous tasks. This gives p-tuning the same task flexiblity as prompt-tuning.

generate(
inputs: List[str] | torch.Tensor | List[dict],
length_params: LengthParam,
sampling_params: SamplingParam | None = None,
batch_size: int | None = 1,
)#

Public method to generate text.

Parameters:
  • inputs (Union[List[str], Tensor, List[dict]]) –

    Can be one of the 3 types:

    1. List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.

      E.g [‘sentence’, ‘sentence2’ … ]

    2. Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder.

      The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences. E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))

    3. List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.

      E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”}, {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.

  • length_params (LengthParam) –

    a dictionary type which controls the sampling length.

    • max_length: int, The maximum length of the sequence to be generated.

    • min_length: int, The minimum length of the sequence to be generated.

    If None, max_length is set to 30, and min_length is set to None

  • sampling_params (SamplingParam) –

    a dictionary type which contains the parameters for text sampling. It has the following keys

    • use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise

    • top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering.

    • top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.

    • repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty.

    • add_BOS: bool, Whether add the bos token at the begining of the prompt

    • all_probs: bool # whether return the log prob for all the tokens in vocab

    • compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False

    • end_strings: List[str] # generation will stop when one of these tokens is generated

    Default None, If it is None, use_greedy will be “True”.

Returns:

It generates the output in a dictionary type. It has the following keys,

  • sentences: List[str], output sentences

  • tokens: List[List[str]], output sentences borken into tokens

  • logprob: List[List[float]], log prob of generated tokens

  • full_logprob: List[List[float]], log prob of all the tokens in the vocab

  • token_ids: List[List[int]], output sentence token ids

  • offsets: List[List[int]] # list of tokens start positions in text

setup(stage=None)#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters:

stage – fit, validate, test or predict

class nemo.collections.nlp.models.language_modeling.megatron_t5_adapter_model.MegatronT5AdapterLearningModel(*args: Any, **kwargs: Any)#

Bases: MegatronT5BaseAdapterModel

TODO (@adithyare)

__init__(
cfg: omegaconf.dictconfig.DictConfig,
trainer: pytorch_lightning.trainer.trainer.Trainer,
)#

Base class from which all NeMo models should inherit

Parameters:
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

setup(stage=None)#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters:

stage – fit, validate, test or predict

state_dict(
destination=None,
prefix=None,
keep_vars=False,
)#

Creates a state_dict using only the adapter parameters. This ensures that this wrapper class will only checkpoint the adapter weights and not the rest of the base GPT Model.

class nemo.collections.nlp.models.language_modeling.megatron_t5_adapter_model.MegatronT5InfusedAdapterModel(*args: Any, **kwargs: Any)#

Bases: MegatronT5BaseAdapterModel

MegatronGPTInfusedAdapterModel is a model that combines a base model (GPTModel) with a “Infused Adapter that can Inhibiting and Amplify Inner Activations”, known as IA3. This class supports the addition of IA3 into a transformer based LM as described in Liu et al. (https://arxiv.org/pdf/2205.05638.pdf)

Three adapter’s are inserted into each Transformer layer in the base GPT Model. Each adapter is basically a vector that simply scales the key, value or ffn hidden representations.

It is assumed that these set of adapters will then be trained for a specific task. Once trained, the adapter weights will be saved and can be re-loaded and infused into the same GPT Model for inference.

__init__(
cfg: omegaconf.dictconfig.DictConfig,
trainer: pytorch_lightning.trainer.trainer.Trainer,
)#

Base class from which all NeMo models should inherit

Parameters:
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

setup(stage=None)#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters:

stage – fit, validate, test or predict

state_dict(
destination=None,
prefix=None,
keep_vars=False,
)#

Creates a state_dict using only the adapter parameters. This ensures that this wrapper class will only checkpoint the adapter weights and not the rest of the base GPT Model.

Modules#

class nemo.collections.nlp.modules.common.megatron.module.MegatronModule(*args: Any, **kwargs: Any)#

Bases: Module

Megatron specific extensions of torch Module with support for pipelining.

class nemo.collections.nlp.modules.common.megatron.module.Float16Module(*args: Any, **kwargs: Any)#

Bases: MegatronModule

class nemo.collections.nlp.models.language_modeling.megatron.gpt_model.GPTModel(*args: Any, **kwargs: Any)#

Bases: MegatronModule

GPT-2 Language model.

class nemo.collections.nlp.models.language_modeling.megatron.bert.bert_model.NeMoBertModel(*args: Any, **kwargs: Any)#

Bases: MegatronModule

Bert Language model. Model returns [seq, batch, hidden] shape

class nemo.collections.nlp.modules.common.megatron.token_level_encoder_decoder.MegatronTokenLevelEncoderDecoderModule(*args: Any, **kwargs: Any)#

Bases: MegatronModule, AdapterModuleMixin

Token-based (input/output is tokens) encoder-decoder model (e.g. T5 Language model.)

forward(
enc_input_ids=None,
enc_attn_mask=None,
dec_input_ids=None,
dec_attn_mask=None,
token_type_ids=None,
labels=None,
batch_data=None,
enc_output=None,
enc_output_attn_mask=None,
enc_input=None,
output_enc_hidden_only=False,
)#

Return value is per token / per dimension (i.e., non collapsed loss value)

class nemo.collections.nlp.modules.common.megatron.retrieval_token_level_encoder_decoder.MegatronRetrievalTokenLevelEncoderDecoderModule(
*args: Any,
**kwargs: Any,
)#

Bases: MegatronModule

Token-based (input/output is tokens) retrieval encoder-decoder model

forward(
input_ids,
input_attn_mask,
retrieved_ids,
retrieved_attn_mask,
token_type_ids=None,
labels=None,
input_emb=None,
set_inference_key_value_memory=False,
inference_max_sequence_len=None,
neighbors=None,
position_ids=None,
)#

Return value is per token / per dimension (i.e., non collapsed loss value)

Datasets#

class nemo.collections.nlp.data.language_modeling.megatron.blendable_dataset.BlendableDataset(*args: Any, **kwargs: Any)#

Bases: Dataset

class nemo.collections.nlp.data.language_modeling.megatron.gpt_dataset.GPTDataset(*args: Any, **kwargs: Any)#

Bases: Dataset

class nemo.collections.nlp.data.language_modeling.megatron.gpt_dataset.MockGPTDataset(*args: Any, **kwargs: Any)#

Bases: Dataset

class nemo.collections.nlp.data.language_modeling.megatron.bert_dataset.BertDataset(*args: Any, **kwargs: Any)#

Bases: Dataset

class nemo.collections.nlp.data.language_modeling.megatron.base_prompt_learning_dataset.BasePromptLearningDataset(*args: Any, **kwargs: Any)#

Bases: Dataset

The base dataset class for prompt-tuning or p-tuning. TODO: (@adithyare) should be merged into GPTPromptLearningDataset

class nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_dataset.GPTSFTDataset(*args: Any, **kwargs: Any)#

Bases: Dataset

class nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_chat_dataset.GPTSFTChatDataset(*args: Any, **kwargs: Any)#

Bases: GPTSFTDataset

class nemo.collections.nlp.data.language_modeling.megatron.retro_dataset.RETRODataset(*args: Any, **kwargs: Any)#

Bases: Dataset

class nemo.collections.nlp.data.language_modeling.megatron.t5_dataset.T5Dataset(*args: Any, **kwargs: Any)#

Bases: Dataset

class nemo.collections.nlp.data.language_modeling.megatron.t5_prompt_learning_dataset.T5PromptLearningDataset(*args: Any, **kwargs: Any)#

Bases: BasePromptLearningDataset

The dataset class for prompt-tuning or p-tuning pretrained T5 models.

class nemo.collections.nlp.data.language_modeling.megatron.ul2_dataset.UL2Dataset(*args: Any, **kwargs: Any)#

Bases: T5Dataset

UL2 Dataset from https://arxiv.org/abs/2205.05131. Consists of three different objectives:

  1. Short span masking with small probabilities (ex: T5). Typically max ngram size of 5 with 0.15 mask prob.

  2. Extreme span masking with either large probabilities or large ngram sizes or both.

  3. Prefx-LM as in the T5 or LM-adapted T5 (prompt-tuning paper).

Adapter Mixin Class#

class nemo.collections.nlp.parts.mixins.nlp_adapter_mixins.NLPAdapterModelMixin(*args, **kwargs)#

Bases: object

NLP Adapter Mixin that can augment any transformer-based model with Adapter module support. This mixin class should be used only with a top level ModelPT subclass, that includes either a model or an enc_dec_model submodule. This mixin class adds several utility methods to add, load and save adapters.

An Adapter module is any Pytorch nn.Module that possess a few properties :

  • It’s input and output dimension are the same, while the hidden dimension need not be the same.

  • The final layer of the Adapter module is zero-initialized, so that the residual connection to the adapter yields the original output.

This mixin class aims to integrate with PEFT, which is one or more adapters modules. The two features of PEFT, layer selection and weight tying, are also supported in this mixin class.

add_adapter(
peft_cfgs: PEFTConfig | List[PEFTConfig],
)#

High level API to add one or more adapter modules to the model, and freeze the base weights This method supports adding adapter modules from PEFTConfig or list of PEFTConfig. It would add corresponding adapter modules. Layer selection and weight tying would be applied if it’s in PEFTConfig

Parameters:

peft_cfgs – One or more PEFTConfig objects that specify the PEFT method configuration

load_adapters(
filepath: str,
peft_cfgs: PEFTConfig | List[PEFTConfig] | None = None,
map_location: str | None = None,
)#

Utility method that restores only the adapter module(s), and not the entire model itself. This allows the sharing of adapters which are often just a fraction of the size of the full model, enabling easier delivery.

Note

During restoration, assumes that the model does not currently already have one or more adapter modules.

Parameters:
  • filepath – Filepath of the .ckpt or .nemo file.

  • peft_cfgs – One or more PEFTConfig objects that specify the PEFT method configuration. If none, will infer from the .nemo checkpoint

  • map_location – Pytorch flag, where to place the adapter(s) state dict(s).

classmethod merge_cfg_with(
path: str,
cfg: omegaconf.DictConfig,
) omegaconf.DictConfig#

Merge a given configuration dictionary cfg with the configuration dictionary obtained from restoring a MegatronGPTSFTModel or MegatronT5SFTModel at the specified path.

Parameters:
  • path (str) – The path to the SFT model checkpoint to be restored.

  • cfg (DictConfig) – The configuration dictionary to merge.

Returns:

The merged configuration dictionary.

Return type:

DictConfig

Examples

>>> path = "/path/to/model/checkpoint"
>>> cfg = DictConfig({"model": {"key": "value"}, "trainer": {"precision": 16}})
>>> merged_cfg = merge_cfg_with(path, cfg)

Notes

  • The function resolves variables within the cfg dictionary using OmegaConf.resolve.

  • Keys in cfg.model will override the corresponding keys in the output dictionary.

  • If “train_ds” exists in cfg.model.data, it updates micro_batch_size and global_batch_size.

  • If cfg.trainer contains a “precision” key, it updates output.precision.

classmethod merge_inference_cfg(
path: str,
cfg: omegaconf.DictConfig,
) omegaconf.DictConfig#

Generate a configuration dictionary by a given configuration dictionary cfg with the configuration dictionary obtained from restoring a MegatronGPTSFTModel or MegatronT5SFTModel at the specified path and modify cfg for inference

Parameters:
  • path (str) – The path to the SFT model checkpoint to be restored.

  • cfg (DictConfig) – The configuration dictionary to modify for inference.

Returns:

The configuration dictionary for inference.

Return type:

DictConfig

Examples

>>> path = "/path/to/model/checkpoint"
>>> cfg = DictConfig({"model": {"key": "value"}, "trainer": {"precision": 16}})
>>> merged_cfg = merge_inference_cfg(path, cfg)

Notes

  • “precision” and “test_ds” from cfg will override the corresponding keys in the output dictionary

  • “activations_checkpoint” will be ovrrided to None in the output dictionary

  • “use_flash_attention” will be True if in one of the configuration dictionarys is True

  • “seq_len_interpolation_factor” will be overrided from cfg if it’s not None from checkpoint

Exportable Model Classes#

class nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTExportableModel(*args: Any, **kwargs: Any)#

Bases: Module, Exportable

Megatron GPT Wrapper for ONNX export