Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.
NeMo Large language Model API
Pretraining Model Classes
- class nemo.collections.nlp.models.language_modeling.megatron_base_model.MegatronBaseModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.models.nlp_model.NLPModel
Megatron base class. All NeMo Megatron models inherit from this class.
Initialize the model parallel world for nemo.
Turn on all of the nvidia optimizations.
If cfg.tokenizer is available, it loads the tokenizer and pad the vocab to the correct size for tensor model parallelism.
If using distributed optimizer, configure to be compatible with O2 level optimizations and/or model parallelism.
Perform gradient clipping: grad_clip_pl_default triggers the PyTorch Lightning default implementation, with_distributed_adam triggers the distributed optimizer’s implementation, megatron_amp_O2 triggers gradient clipping on the main grads, and otherwise gradient clipping is performed on the model grads.
- __init__(cfg: omegaconf.dictconfig.DictConfig, trainer: pytorch_lightning.trainer.trainer.Trainer, no_lm_init=True)
Base class from which all NeMo models should inherit
- Parameters
cfg (DictConfig) –
configuration object. The cfg object should have (optionally) the following sub-configs:
train_ds - to instantiate training dataset
validation_ds - to instantiate validation dataset
test_ds - to instantiate testing dataset
optim - to instantiate optimizer with learning rate scheduler
trainer (Optional) – Pytorch Lightning Trainer instance
- class nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.models.language_modeling.megatron_base_model.MegatronBaseModel
,nemo.collections.nlp.modules.common.transformer.text_generation.TextGeneration
Megatron GPT pretraining
- generate(inputs: Union[List[str], torch.Tensor, List[dict]], length_params: nemo.collections.nlp.modules.common.transformer.text_generation.LengthParam, sampling_params: Optional[nemo.collections.nlp.modules.common.transformer.text_generation.SamplingParam] = None, *, strategy: Optional[nemo.collections.nlp.modules.common.text_generation_strategy.TextGenerationStrategy] = None) nemo.collections.nlp.modules.common.transformer.text_generation.OutputType
Public method to generate text.
- Parameters
inputs (Union[List[str], Tensor, List[dict]]) –
Can be one of the 3 types:
- List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.
E.g [‘sentence’, ‘sentence2’ … ]
- Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder.
The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences. E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))
- List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.
E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”}, {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.
length_params (LengthParam) –
a dictionary type which controls the sampling length.
max_length: int, The maximum length of the sequence to be generated.
min_length: int, The minimum length of the sequence to be generated.
If None, max_length is set to 30, and min_length is set to None
sampling_params (SamplingParam) –
a dictionary type which contains the parameters for text sampling. It has the following keys
use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise
top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty.
add_BOS: bool, Whether add the bos token at the begining of the prompt
all_probs: bool # whether return the log prob for all the tokens in vocab
compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False
end_strings: List[str] # generation will stop when one of these tokens is generated
Default None, If it is None, use_greedy will be “True”.
- Returns
It generates the output in a dictionary type. It has the following keys,
sentences: List[str], output sentences
tokens: List[List[str]], output sentences borken into tokens
logprob: List[List[float]], log prob of generated tokens
full_logprob: List[List[float]], log prob of all the tokens in the vocab
token_ids: List[List[int]], output sentence token ids
offsets: List[List[int]] # list of tokens start positions in text
- on_load_checkpoint(checkpoint) None
LightningModule hook: https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#on-load-checkpoint
- on_save_checkpoint(checkpoint) None
LightningModule hook: https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#on-save-checkpoint
- setup(stage=None)
PTL hook that is executed after DDP spawns. We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.
- Parameters
stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.
- training_step(dataloader_iter)
We pass the dataloader iterator function to the micro-batch scheduler. The input batch to each micro-batch is fetched using the dataloader function in the micro-batch fwd function.
- validation_step(dataloader_iter, dataloader_idx=0)
Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.
- class nemo.collections.nlp.models.language_modeling.megatron_bert_model.MegatronBertModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.models.language_modeling.megatron_base_model.MegatronBaseModel
Megatron Bert pretraining. Model returns [batch, seq, hidden] shape
- on_load_checkpoint(checkpoint) None
LightningModule hook: https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#on-load-checkpoint
- on_save_checkpoint(checkpoint) None
LightningModule hook: https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#on-save-checkpoint
- setup(stage=None)
PTL hook that is executed after DDP spawns. We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.
- Parameters
stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.
- class nemo.collections.nlp.models.language_modeling.megatron_bart_model.MegatronBARTModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.models.language_modeling.megatron_t5_model.MegatronT5Model
Megatron BART pretraining
- setup(stage=None)
PTL hook that is executed after DDP spawns. We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.
- Parameters
stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.
- training_step(dataloader_iter)
Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. Batch should be a list of microbatches and those microbatches should on CPU. Microbatches are then moved to GPU during the pipeline. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.
- validation_step(dataloader_iter)
return_values - if given, returns a dictionary with given keys and corresponding values
- class nemo.collections.nlp.models.language_modeling.megatron_retrieval_model.MegatronRetrievalModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.models.language_modeling.megatron_base_model.MegatronBaseModel
,nemo.collections.nlp.modules.common.transformer.text_generation.TextGeneration
Megatron Retrieval enhanced language model
- generate(inputs: Union[List[str], torch.Tensor, List[dict]], length_params: nemo.collections.nlp.modules.common.transformer.text_generation.LengthParam, sampling_params: Optional[nemo.collections.nlp.modules.common.transformer.text_generation.SamplingParam] = None, **args) nemo.collections.nlp.modules.common.transformer.text_generation.OutputType
Public method to generate text.
- Parameters
inputs (Union[List[str], Tensor, List[dict]]) –
Can be one of the 3 types:
- List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.
E.g [‘sentence’, ‘sentence2’ … ]
- Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder.
The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences. E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))
- List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.
E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”}, {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.
length_params (LengthParam) –
a dictionary type which controls the sampling length.
max_length: int, The maximum length of the sequence to be generated.
min_length: int, The minimum length of the sequence to be generated.
If None, max_length is set to 30, and min_length is set to None
sampling_params (SamplingParam) –
a dictionary type which contains the parameters for text sampling. It has the following keys
use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise
top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty.
add_BOS: bool, Whether add the bos token at the begining of the prompt
all_probs: bool # whether return the log prob for all the tokens in vocab
compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False
end_strings: List[str] # generation will stop when one of these tokens is generated
Default None, If it is None, use_greedy will be “True”.
- Returns
It generates the output in a dictionary type. It has the following keys,
sentences: List[str], output sentences
tokens: List[List[str]], output sentences borken into tokens
logprob: List[List[float]], log prob of generated tokens
full_logprob: List[List[float]], log prob of all the tokens in the vocab
token_ids: List[List[int]], output sentence token ids
offsets: List[List[int]] # list of tokens start positions in text
- setup(stage=None)
Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.
- Parameters
stage – fit, validate, test or predict
- class nemo.collections.nlp.models.language_modeling.megatron_t5_model.MegatronT5Model(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.models.language_modeling.megatron_lm_encoder_decoder_model.MegatronLMEncoderDecoderModel
Megatron T5 pretraining
- complete(request: Dict)
Autoregressively invokes language model in the inference mode
- Parameters
request – Dictionary with the following fields * prompt: a string which text the model should complete. * tokens_to_generate: how many tokens to generate while doing prompt completion.
- Returns
- A python dictionary with the following fields
prompt: original text of the prompt
tokenized_prompt: list of (str) tokens from prompt
- completion: a python dictionary with the following subfields:
tokens: a list of triples (token, token_id, log_prob) comprising completion
text: completion text (as a single string)
- Return type
response
- decode(tokens_enc, enc_mask, num_tokens_to_generate, encoder_input=None, tokenizer=None, enc_output=None, enc_output_attn_mask=None, ignore_ids=[], bos_id=None, predicted_tokens_dec=None, batch_data=None, sampling_method: str = 'greedy-search', sampling_kwargs: dict = {})
- Parameters
tokens_enc – a tensor of shape [batch_size, seq_len] that contains the input tokens.
enc_mask – a tensor of shape [batch_size, seq_len] that contains the input tokens mask (1 for active, 0 for inactive).
num_tokens_to_generate – the max number of tokens to generate.
encoder_input – a tensor of shape [batch_size, seq_len, hidden_size] that contains the encoder hidden states (replaces tokens_enc if given).
tokenizer – a tokenizer object.
enc_output – a tensor of shape [batch_size, seq_len, hidden_size] that contains the encoder hidden states (replaces tokens_enc and encoder_input if given).
enc_output_attn_mask – a tensor of shape [batch_size, seq_len] that contains the encoder attention mask (replaces enc_mask if given).
ignore_ids – a list of token ids to ignore when sampling.
bos_id – the id of the beginning of sentence token. If None, will use tokenizer.bos_id unless explicitly set to something else.
predicted_tokens_dec – a tensor of shape [batch_size, seq_len] that contains the tokens that have already been decoded.
sampling_method – a sampling method to use in the decoding iterations. Currently supported methods are “beam-search”/”greedy-search”/”topkp-sampling”. The argument specifies the sampling function that takes in a tensor of logits [batch_size, vocab_size] and returns a tuple (tensor of log_probs [batch_size], tensor of sampled tokens_ids from logits [batch_size]). If the beam search is enabled, the sampling function returns tensors [batch_size, beam_size]
sampling_kwargs – dict with arguments to be passed to the sampling function. Please refer to the method get_sampling_token_fn to see which arguments are required for a chosen sampling_method.
- Returns
tuple of tensors [batch_size, seq_len +1], [batch_size, seq_len] for predicted tokens and their log probs. If sampling_method == ‘beam-size’ and keep_only_best_tokens is False the shape of the tensors are [batch_size, beam_size, seq_len + 1], [batch_size, beam_size, seq_len]
- encode(tokens_enc, enc_mask, encoder_input=None, batch_data=None, reconfigure_microbatch=True)
- Parameters
tokens_enc – encoder input tokens
enc_mask – corresponding mask
encoder_input – encoder input (bypass tokens), if given tokens_enc can be None.
batch_data – passed directly to all hidden transformations and losses. Can be used to pass additional data like class label. Format is not defined and should match the expected format of the used hiddens modules.
- setup(stage=None)
PTL hook that is executed after DDP spawns. We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.
- Parameters
stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.
- training_step(dataloader_iter)
Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. Batch should be a list of microbatches and those microbatches should on CPU. Microbatches are then moved to GPU during the pipeline. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.
- validation_step(dataloader_iter)
return_values - if given, returns a dictionary with given keys and corresponding values
Customization Model Classes
- class nemo.collections.nlp.models.language_modeling.megatron_gpt_sft_model.MegatronGPTSFTModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.parts.mixins.nlp_adapter_mixins.NLPAdapterModelMixin
,nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel
Megatron GPT Supervised Fine-Tuning
- generate(inputs: Union[List[str], torch.Tensor, List[dict]], length_params: nemo.collections.nlp.modules.common.transformer.text_generation.LengthParam, sampling_params: Optional[nemo.collections.nlp.modules.common.transformer.text_generation.SamplingParam] = None, *, strategy: Optional[nemo.collections.nlp.modules.common.text_generation_strategy.TextGenerationStrategy] = None) nemo.collections.nlp.modules.common.transformer.text_generation.OutputType
Public method to generate text.
- Parameters
inputs (Union[List[str], Tensor, List[dict]]) –
Can be one of the 3 types:
- List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.
E.g [‘sentence’, ‘sentence2’ … ]
- Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder.
The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences. E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))
- List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.
E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”}, {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.
length_params (LengthParam) –
a dictionary type which controls the sampling length.
max_length: int, The maximum length of the sequence to be generated.
min_length: int, The minimum length of the sequence to be generated.
If None, max_length is set to 30, and min_length is set to None
sampling_params (SamplingParam) –
a dictionary type which contains the parameters for text sampling. It has the following keys
use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise
top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty.
add_BOS: bool, Whether add the bos token at the begining of the prompt
all_probs: bool # whether return the log prob for all the tokens in vocab
compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False
end_strings: List[str] # generation will stop when one of these tokens is generated
Default None, If it is None, use_greedy will be “True”.
- Returns
It generates the output in a dictionary type. It has the following keys,
sentences: List[str], output sentences
tokens: List[List[str]], output sentences borken into tokens
logprob: List[List[float]], log prob of generated tokens
full_logprob: List[List[float]], log prob of all the tokens in the vocab
token_ids: List[List[int]], output sentence token ids
offsets: List[List[int]] # list of tokens start positions in text
- setup(stage=None)
PTL hook that is executed after DDP spawns. We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.
- Parameters
stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.
- training_step(dataloader_iter)
We pass the dataloader iterator function to the micro-batch scheduler. The input batch to each micro-batch is fetched using the dataloader function in the micro-batch fwd function.
- validation_step(dataloader_iter)
Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.
- class nemo.collections.nlp.models.language_modeling.megatron_gpt_adapter_model.MegatronGPTAdapterLearningModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.models.language_modeling.megatron_gpt_adapter_model.MegatronGPTBaseAdapterModel
MegatronGPTAdapterLearningModel is a model that combines a base model (GPTModel) with a adapters. This class only supports the canonical Adapter training described in Houlsby et al. (https://arxiv.org/pdf/1902.00751.pdf)
Two adapter’s are inserted into each Transformer layer in the base GPT Model.
It is assumed that these set of adapters will then be trained for a specific task. Once trained, the adapter weights will be saved and can be re-loaded and infused into the same GPT Model for inference.
- __init__(cfg: omegaconf.dictconfig.DictConfig, trainer: pytorch_lightning.trainer.trainer.Trainer)
Base class from which all NeMo models should inherit
- Parameters
cfg (DictConfig) –
configuration object. The cfg object should have (optionally) the following sub-configs:
train_ds - to instantiate training dataset
validation_ds - to instantiate validation dataset
test_ds - to instantiate testing dataset
optim - to instantiate optimizer with learning rate scheduler
trainer (Optional) – Pytorch Lightning Trainer instance
- generate(inputs: Union[List[str], torch.Tensor, List[dict]], length_params: nemo.collections.nlp.modules.common.transformer.text_generation.LengthParam, sampling_params: Optional[nemo.collections.nlp.modules.common.transformer.text_generation.SamplingParam] = None, batch_size: Optional[int] = 1)
Public method to generate text.
- Parameters
inputs (Union[List[str], Tensor, List[dict]]) –
Can be one of the 3 types:
- List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.
E.g [‘sentence’, ‘sentence2’ … ]
- Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder.
The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences. E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))
- List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.
E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”}, {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.
length_params (LengthParam) –
a dictionary type which controls the sampling length.
max_length: int, The maximum length of the sequence to be generated.
min_length: int, The minimum length of the sequence to be generated.
If None, max_length is set to 30, and min_length is set to None
sampling_params (SamplingParam) –
a dictionary type which contains the parameters for text sampling. It has the following keys
use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise
top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty.
add_BOS: bool, Whether add the bos token at the begining of the prompt
all_probs: bool # whether return the log prob for all the tokens in vocab
compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False
end_strings: List[str] # generation will stop when one of these tokens is generated
Default None, If it is None, use_greedy will be “True”.
- Returns
It generates the output in a dictionary type. It has the following keys,
sentences: List[str], output sentences
tokens: List[List[str]], output sentences borken into tokens
logprob: List[List[float]], log prob of generated tokens
full_logprob: List[List[float]], log prob of all the tokens in the vocab
token_ids: List[List[int]], output sentence token ids
offsets: List[List[int]] # list of tokens start positions in text
- setup(stage=None)
Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.
- Parameters
stage – fit, validate, test or predict
- state_dict(destination=None, prefix=None, keep_vars=False)
Creates a state_dict using only the adapter parameters. This ensures that this wrapper class will only checkpoint the adapter weights and not the rest of the base GPT Model.
- class nemo.collections.nlp.models.language_modeling.megatron_gpt_adapter_model.MegatronGPTInfusedAdapterModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.models.language_modeling.megatron_gpt_adapter_model.MegatronGPTBaseAdapterModel
MegatronGPTInfusedAdapterModel is a model that combines a base model (GPTModel) with a “Infused Adapter that can Inhibiting and Amplify Inner Activations”, known as IA3. This class supports the addition of IA3 into a transformer based LM as described in Liu et al. (https://arxiv.org/pdf/2205.05638.pdf)
Three adapter’s are inserted into each Transformer layer in the base GPT Model. Each adapter is basically a vector that simply scales the key, value or ffn hidden representations.
It is assumed that these set of adapters will then be trained for a specific task. Once trained, the adapter weights will be saved and can be re-loaded and infused into the same GPT Model for inference.
- __init__(cfg: omegaconf.dictconfig.DictConfig, trainer: pytorch_lightning.trainer.trainer.Trainer)
Base class from which all NeMo models should inherit
- Parameters
cfg (DictConfig) –
configuration object. The cfg object should have (optionally) the following sub-configs:
train_ds - to instantiate training dataset
validation_ds - to instantiate validation dataset
test_ds - to instantiate testing dataset
optim - to instantiate optimizer with learning rate scheduler
trainer (Optional) – Pytorch Lightning Trainer instance
- generate(inputs: Union[List[str], torch.Tensor, List[dict]], length_params: nemo.collections.nlp.modules.common.transformer.text_generation.LengthParam, sampling_params: Optional[nemo.collections.nlp.modules.common.transformer.text_generation.SamplingParam] = None, batch_size: Optional[int] = 1)
Public method to generate text.
- Parameters
inputs (Union[List[str], Tensor, List[dict]]) –
Can be one of the 3 types:
- List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.
E.g [‘sentence’, ‘sentence2’ … ]
- Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder.
The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences. E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))
- List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.
E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”}, {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.
length_params (LengthParam) –
a dictionary type which controls the sampling length.
max_length: int, The maximum length of the sequence to be generated.
min_length: int, The minimum length of the sequence to be generated.
If None, max_length is set to 30, and min_length is set to None
sampling_params (SamplingParam) –
a dictionary type which contains the parameters for text sampling. It has the following keys
use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise
top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty.
add_BOS: bool, Whether add the bos token at the begining of the prompt
all_probs: bool # whether return the log prob for all the tokens in vocab
compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False
end_strings: List[str] # generation will stop when one of these tokens is generated
Default None, If it is None, use_greedy will be “True”.
- Returns
It generates the output in a dictionary type. It has the following keys,
sentences: List[str], output sentences
tokens: List[List[str]], output sentences borken into tokens
logprob: List[List[float]], log prob of generated tokens
full_logprob: List[List[float]], log prob of all the tokens in the vocab
token_ids: List[List[int]], output sentence token ids
offsets: List[List[int]] # list of tokens start positions in text
- setup(stage=None)
Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.
- Parameters
stage – fit, validate, test or predict
- state_dict(destination=None, prefix=None, keep_vars=False)
Creates a state_dict using only the adapter parameters. This ensures that this wrapper class will only checkpoint the adapter weights and not the rest of the base GPT Model.
- class nemo.collections.nlp.models.language_modeling.megatron_gpt_prompt_learning_model.MegatronGPTPromptLearningModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.models.language_modeling.megatron_base_prompt_learning_model.MegatronBasePromptLearningModel
Model class for prompt-tuning or p-tuning a pretrained Megatron GPT model.
Prompt Tuning initalizes virtual prompt embeddings directly from a copy of certain token embeddings from the the pretrained GPT model’s vocabulary and directly tunes these embedding weights. The token embeddings used in initalization are specified by the user in the config file. The model can be prompt-tuned for multiple tasks at once. virtual prompts are stored in a prompt table and can be added or deleted without disrupting virtual prompts for other tasks.
P-tuning initializes an LSTM encoder model that generates virtual prompt embeddings for every task. Each task shares the same encoder. After ptuning is compelete, the learned virtual prompts can be saved to the prompt table using add_ptuned_prompts_to_prompt_table(). Thus, if a user wants to add a new virtual prompt via p-tuning, they do not need to retrain on all previous tasks. This gives p-tuning the same task flexiblity as prompt-tuning.
- generate(inputs: Union[List[str], torch.Tensor, List[dict]], length_params: nemo.collections.nlp.modules.common.transformer.text_generation.LengthParam, sampling_params: Optional[nemo.collections.nlp.modules.common.transformer.text_generation.SamplingParam] = None, batch_size: Optional[int] = 1)
Public method to generate text.
- Parameters
inputs (Union[List[str], Tensor, List[dict]]) –
Can be one of the 3 types:
- List of strings. Each element of the list provides input prompt. The model will apply tokenizer on it.
E.g [‘sentence’, ‘sentence2’ … ]
- Tuple of Pytorch Tensors (context_tokens, context_lengths). The context_tokens has shape (batch_size, seq_length), it’s the batched sequences of tokens used as a prompst for the generation or as model inputs to the encoder.
The generative model will skip the tokenization and padding step. The context_lengths has shape (batch_size,), it indicates the length of the context tokens for each of the input sequences. E.g. ( torch.tensor([[23,5234,23,35,…], [223,323,23,23232,232,…] …]), torch.tensor([20, 30, …]))
- List of python dict objects. Used for prompt/p-tuning inputs where a set of key-value pairs are converted into input token embeddings for the model.
E.g. [{“prompt-tag”: “sentiment”, “sentence”: “this is a good movie”}, {“prompt-tag”: “qa”, “context”: “some context text”, “question”: “a simple question”} … ] where ‘prompt-tag’ is used to identify the type of NLP task to solve.
length_params (LengthParam) –
a dictionary type which controls the sampling length.
max_length: int, The maximum length of the sequence to be generated.
min_length: int, The minimum length of the sequence to be generated.
If None, max_length is set to 30, and min_length is set to None
sampling_params (SamplingParam) –
a dictionary type which contains the parameters for text sampling. It has the following keys
use_greedy: bool, Whether or not to use sampling ; use greedy decoding otherwise
top_k: int, The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p: float, If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
repetition_penalty: float, The parameter for repetition penalty. 1.0 means no penalty.
add_BOS: bool, Whether add the bos token at the begining of the prompt
all_probs: bool # whether return the log prob for all the tokens in vocab
compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False
end_strings: List[str] # generation will stop when one of these tokens is generated
Default None, If it is None, use_greedy will be “True”.
- Returns
It generates the output in a dictionary type. It has the following keys,
sentences: List[str], output sentences
tokens: List[List[str]], output sentences borken into tokens
logprob: List[List[float]], log prob of generated tokens
full_logprob: List[List[float]], log prob of all the tokens in the vocab
token_ids: List[List[int]], output sentence token ids
offsets: List[List[int]] # list of tokens start positions in text
- setup(stage=None)
Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.
- Parameters
stage – fit, validate, test or predict
- class nemo.collections.nlp.models.language_modeling.megatron_t5_adapter_model.MegatronT5AdapterLearningModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.models.language_modeling.megatron_t5_adapter_model.MegatronT5BaseAdapterModel
TODO (@adithyare)
- __init__(cfg: omegaconf.dictconfig.DictConfig, trainer: pytorch_lightning.trainer.trainer.Trainer)
Base class from which all NeMo models should inherit
- Parameters
cfg (DictConfig) –
configuration object. The cfg object should have (optionally) the following sub-configs:
train_ds - to instantiate training dataset
validation_ds - to instantiate validation dataset
test_ds - to instantiate testing dataset
optim - to instantiate optimizer with learning rate scheduler
trainer (Optional) – Pytorch Lightning Trainer instance
- setup(stage=None)
Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.
- Parameters
stage – fit, validate, test or predict
- state_dict(destination=None, prefix=None, keep_vars=False)
Creates a state_dict using only the adapter parameters. This ensures that this wrapper class will only checkpoint the adapter weights and not the rest of the base GPT Model.
- class nemo.collections.nlp.models.language_modeling.megatron_t5_adapter_model.MegatronT5InfusedAdapterModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.models.language_modeling.megatron_t5_adapter_model.MegatronT5BaseAdapterModel
MegatronGPTInfusedAdapterModel is a model that combines a base model (GPTModel) with a “Infused Adapter that can Inhibiting and Amplify Inner Activations”, known as IA3. This class supports the addition of IA3 into a transformer based LM as described in Liu et al. (https://arxiv.org/pdf/2205.05638.pdf)
Three adapter’s are inserted into each Transformer layer in the base GPT Model. Each adapter is basically a vector that simply scales the key, value or ffn hidden representations.
It is assumed that these set of adapters will then be trained for a specific task. Once trained, the adapter weights will be saved and can be re-loaded and infused into the same GPT Model for inference.
- __init__(cfg: omegaconf.dictconfig.DictConfig, trainer: pytorch_lightning.trainer.trainer.Trainer)
Base class from which all NeMo models should inherit
- Parameters
cfg (DictConfig) –
configuration object. The cfg object should have (optionally) the following sub-configs:
train_ds - to instantiate training dataset
validation_ds - to instantiate validation dataset
test_ds - to instantiate testing dataset
optim - to instantiate optimizer with learning rate scheduler
trainer (Optional) – Pytorch Lightning Trainer instance
- setup(stage=None)
Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.
- Parameters
stage – fit, validate, test or predict
- state_dict(destination=None, prefix=None, keep_vars=False)
Creates a state_dict using only the adapter parameters. This ensures that this wrapper class will only checkpoint the adapter weights and not the rest of the base GPT Model.
Modules
- class nemo.collections.nlp.modules.common.megatron.module.MegatronModule(*args: Any, **kwargs: Any)
Bases:
torch.nn.Module
Megatron specific extensions of torch Module with support for pipelining.
- class nemo.collections.nlp.modules.common.megatron.module.Float16Module(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.modules.common.megatron.module.MegatronModule
- class nemo.collections.nlp.models.language_modeling.megatron.gpt_model.GPTModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.modules.common.megatron.module.MegatronModule
GPT-2 Language model.
- class nemo.collections.nlp.models.language_modeling.megatron.bert.bert_model.NeMoBertModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.modules.common.megatron.module.MegatronModule
Bert Language model. Model returns [seq, batch, hidden] shape
- class nemo.collections.nlp.modules.common.megatron.token_level_encoder_decoder.MegatronTokenLevelEncoderDecoderModule(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.modules.common.megatron.module.MegatronModule
,nemo.core.classes.mixins.adapter_mixins.AdapterModuleMixin
Token-based (input/output is tokens) encoder-decoder model (e.g. T5 Language model.)
- forward(enc_input_ids=None, enc_attn_mask=None, dec_input_ids=None, dec_attn_mask=None, token_type_ids=None, labels=None, batch_data=None, enc_output=None, enc_output_attn_mask=None, enc_input=None, output_enc_hidden_only=False)
Return value is per token / per dimension (i.e., non collapsed loss value)
- class nemo.collections.nlp.modules.common.megatron.retrieval_token_level_encoder_decoder.MegatronRetrievalTokenLevelEncoderDecoderModule(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.modules.common.megatron.module.MegatronModule
Token-based (input/output is tokens) retrieval encoder-decoder model
- forward(input_ids, input_attn_mask, retrieved_ids, retrieved_attn_mask, token_type_ids=None, labels=None, input_emb=None, set_inference_key_value_memory=False, inference_max_sequence_len=None, neighbors=None, position_ids=None)
Return value is per token / per dimension (i.e., non collapsed loss value)
Datasets
- class nemo.collections.nlp.data.language_modeling.megatron.blendable_dataset.BlendableDataset(*args: Any, **kwargs: Any)
Bases:
torch.utils.data.Dataset
- class nemo.collections.nlp.data.language_modeling.megatron.gpt_dataset.GPTDataset(*args: Any, **kwargs: Any)
Bases:
nemo.core.classes.dataset.Dataset
- class nemo.collections.nlp.data.language_modeling.megatron.gpt_dataset.MockGPTDataset(*args: Any, **kwargs: Any)
Bases:
nemo.core.classes.dataset.Dataset
- class nemo.collections.nlp.data.language_modeling.megatron.bert_dataset.BertDataset(*args: Any, **kwargs: Any)
Bases:
torch.utils.data.Dataset
- class nemo.collections.nlp.data.language_modeling.megatron.base_prompt_learning_dataset.BasePromptLearningDataset(*args: Any, **kwargs: Any)
Bases:
nemo.core.classes.dataset.Dataset
The base dataset class for prompt-tuning or p-tuning. TODO: (@adithyare) should be merged into GPTPromptLearningDataset
- class nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_dataset.GPTSFTDataset(*args: Any, **kwargs: Any)
Bases:
nemo.core.classes.dataset.Dataset
- class nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_chat_dataset.GPTSFTChatDataset(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_dataset.GPTSFTDataset
- class nemo.collections.nlp.data.language_modeling.megatron.retro_dataset.RETRODataset(*args: Any, **kwargs: Any)
Bases:
nemo.core.classes.dataset.Dataset
- class nemo.collections.nlp.data.language_modeling.megatron.t5_dataset.T5Dataset(*args: Any, **kwargs: Any)
Bases:
nemo.core.classes.dataset.Dataset
- class nemo.collections.nlp.data.language_modeling.megatron.t5_prompt_learning_dataset.T5PromptLearningDataset(*args: Any, **kwargs: Any)
-
The dataset class for prompt-tuning or p-tuning pretrained T5 models.
- class nemo.collections.nlp.data.language_modeling.megatron.ul2_dataset.UL2Dataset(*args: Any, **kwargs: Any)
Bases:
nemo.collections.nlp.data.language_modeling.megatron.t5_dataset.T5Dataset
UL2 Dataset from https://arxiv.org/abs/2205.05131. Consists of three different objectives:
Short span masking with small probabilities (ex: T5). Typically max ngram size of 5 with 0.15 mask prob.
Extreme span masking with either large probabilities or large ngram sizes or both.
Prefx-LM as in the T5 or LM-adapted T5 (prompt-tuning paper).
Adapter Mixin Class
- class nemo.collections.nlp.parts.mixins.nlp_adapter_mixins.NLPAdapterModelMixin(*args, **kwargs)
Bases:
object
NLP Adapter Mixin that can augment any transformer-based model with Adapter module support. This mixin class should be used only with a top level ModelPT subclass, that includes either a model or an enc_dec_model submodule. This mixin class adds several utility methods to add, load and save adapters.
An Adapter module is any Pytorch nn.Module that possess a few properties :
It’s input and output dimension are the same, while the hidden dimension need not be the same.
The final layer of the Adapter module is zero-initialized, so that the residual connection to the adapter yields the original output.
This mixin class aims to integrate with PEFT, which is one or more adapters modules. The two features of PEFT, layer selection and weight tying, are also supported in this mixin class.
- add_adapter(peft_cfgs: Union[nemo.collections.nlp.parts.peft_config.PEFTConfig, List[nemo.collections.nlp.parts.peft_config.PEFTConfig]])
High level API to add one or more adapter modules to the model, and freeze the base weights This method supports adding adapter modules from PEFTConfig or list of PEFTConfig. It would add corresponding adapter modules. Layer selection and weight tying would be applied if it’s in PEFTConfig
- Parameters
peft_cfgs – One or more PEFTConfig objects that specify the PEFT method configuration
- load_adapters(filepath: str, peft_cfgs: Optional[Union[nemo.collections.nlp.parts.peft_config.PEFTConfig, List[nemo.collections.nlp.parts.peft_config.PEFTConfig]]] = None, map_location: Optional[str] = None)
Utility method that restores only the adapter module(s), and not the entire model itself. This allows the sharing of adapters which are often just a fraction of the size of the full model, enabling easier delivery.
Note
During restoration, assumes that the model does not currently already have one or more adapter modules.
- Parameters
filepath – Filepath of the .ckpt or .nemo file.
peft_cfgs – One or more PEFTConfig objects that specify the PEFT method configuration. If none, will infer from the .nemo checkpoint
map_location – Pytorch flag, where to place the adapter(s) state dict(s).
- classmethod merge_cfg_with(path: str, cfg: omegaconf.DictConfig) omegaconf.DictConfig
Merge a given configuration dictionary cfg with the configuration dictionary obtained from restoring a MegatronGPTSFTModel or MegatronT5SFTModel at the specified path.
- Parameters
path (str) – The path to the SFT model checkpoint to be restored.
cfg (DictConfig) – The configuration dictionary to merge.
- Returns
The merged configuration dictionary.
- Return type
DictConfig
Examples
>>> path = "/path/to/model/checkpoint" >>> cfg = DictConfig({"model": {"key": "value"}, "trainer": {"precision": 16}}) >>> merged_cfg = merge_cfg_with(path, cfg)
Notes
The function resolves variables within the cfg dictionary using OmegaConf.resolve.
Keys in cfg.model will override the corresponding keys in the output dictionary.
If “train_ds” exists in cfg.model.data, it updates micro_batch_size and global_batch_size.
If cfg.trainer contains a “precision” key, it updates output.precision.
- classmethod merge_inference_cfg(path: str, cfg: omegaconf.DictConfig) omegaconf.DictConfig
Generate a configuration dictionary by a given configuration dictionary cfg with the configuration dictionary obtained from restoring a MegatronGPTSFTModel or MegatronT5SFTModel at the specified path and modify cfg for inference
- Parameters
path (str) – The path to the SFT model checkpoint to be restored.
cfg (DictConfig) – The configuration dictionary to modify for inference.
- Returns
The configuration dictionary for inference.
- Return type
DictConfig
Examples
>>> path = "/path/to/model/checkpoint" >>> cfg = DictConfig({"model": {"key": "value"}, "trainer": {"precision": 16}}) >>> merged_cfg = merge_inference_cfg(path, cfg)
Notes
“precision” and “test_ds” from cfg will override the corresponding keys in the output dictionary
“activations_checkpoint” will be ovrrided to None in the output dictionary
“use_flash_attention” will be True if in one of the configuration dictionarys is True
“seq_len_interpolation_factor” will be overrided from cfg if it’s not None from checkpoint
Exportable Model Classes
- class nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTExportableModel(*args: Any, **kwargs: Any)
Bases:
torch.nn.Module
,nemo.core.classes.exportable.Exportable
Megatron GPT Wrapper for ONNX export