NeMo NLP collection API

Model Classes

class nemo.collections.nlp.models.TextClassificationModel(*args: Any, **kwargs: Any)[source]

Bases: nemo.collections.nlp.models.nlp_model.NLPModel, nemo.core.classes.exportable.Exportable

classifytext(queries: List[str], batch_size: int = 1, max_seq_length: int = - 1) List[int]

Get prediction for the queries :param queries: text sequences :param batch_size: batch size to use during inference :param max_seq_length: sequences longer than max_seq_length will get truncated. default -1 disables truncation.

Returns

model predictions

Return type

all_preds

create_loss_module()[source]
forward(input_ids, token_type_ids, attention_mask)[source]

No special modification required for Lightning, define it as you normally would in the nn.Module in vanilla PyTorch.

property input_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable input neural type checks

classmethod list_available_models() Optional[Dict[str, str]][source]

Should list all pre-trained models available via NVIDIA NGC cloud. Note: There is no check that requires model names and aliases to be unique. In the case of a collIsion, whatever model (or alias) is listed first in the this returned list will be instantiated.

Returns

A list of PretrainedModelInfo entries

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

setup_test_data(test_data_config: Optional[omegaconf.DictConfig])[source]

(Optionally) Setups data loader to be used in test

Parameters

test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_config: Optional[omegaconf.DictConfig])[source]

Setups data loader to be used in training

Parameters

train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_config: Optional[omegaconf.DictConfig])[source]

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

test_epoch_end(outputs)[source]

Called at the end of test to aggregate outputs. :param outputs: list of individual outputs of each test step.

test_step(batch, batch_idx)[source]

Lightning calls this inside the test loop with the data from the test dataloader passed in as batch.

training_step(batch, batch_idx)[source]

Lightning calls this inside the training loop with the data from the training dataloader passed in as batch.

validation_epoch_end(outputs)[source]

Called at the end of validation to aggregate outputs. :param outputs: list of individual outputs of each validation step.

validation_step(batch, batch_idx)[source]

Lightning calls this inside the validation loop with the data from the validation dataloader passed in as batch.

class nemo.collections.nlp.models.GLUEModel(*args: Any, **kwargs: Any)[source]

Bases: nemo.collections.nlp.models.nlp_model.NLPModel

forward(input_ids, token_type_ids, attention_mask)[source]
property input_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable input neural type checks

classmethod list_available_models() Optional[Dict[str, str]][source]

Should list all pre-trained models available via NVIDIA NGC cloud. Note: There is no check that requires model names and aliases to be unique. In the case of a collIsion, whatever model (or alias) is listed first in the this returned list will be instantiated.

Returns

A list of PretrainedModelInfo entries

multi_validation_epoch_end(outputs, dataloader_idx: int = 0)[source]

Called at the end of validation to aggregate outputs. outputs: list of individual outputs of each validation step.

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

setup_multiple_validation_data(val_data_config: Optional[Union[omegaconf.DictConfig, Dict]] = None)[source]

(Optionally) Setups data loader to be used in validation, with support for multiple data loaders.

Parameters

val_data_layer_config – validation data layer parameters.

setup_training_data(train_data_config: Optional[omegaconf.DictConfig] = None)[source]

Setups data loader to be used in training

Parameters

train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_config: Optional[omegaconf.DictConfig] = None)[source]

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

training_step(batch, batch_idx)[source]
update_data_dir(data_dir: str) None[source]

Update data directory and get data stats with Data Descriptor Weights are later used to setup loss

Parameters

data_dir – path to data directory

validation_step(batch, batch_idx, dataloader_idx=0)[source]
class nemo.collections.nlp.models.PunctuationCapitalizationModel(*args: Any, **kwargs: Any)[source]

Bases: nemo.collections.nlp.models.nlp_model.NLPModel, nemo.core.classes.exportable.Exportable

add_punctuation_capitalization(queries: List[str], batch_size: Optional[int] = None, max_seq_length: int = 64, step: int = 8, margin: int = 16, return_labels: bool = False) List[str][source]

Adds punctuation and capitalization to the queries. Use this method for inference.

Parameters max_seq_length, step, margin are for controlling the way queries are split into segments which then processed by the model. Parameter max_seq_length is a length of a segment after tokenization including special tokens [CLS] in the beginning and [SEP] in the end of a segment. Parameter step is shift between consequent segments. Parameter margin is used to exclude negative effect of subtokens near borders of segments which have only one side context.

If segments overlap, probabilities of overlapping predictions are multiplied and then the label with corresponding to the maximum probability is selected.

Parameters
  • queries – lower cased text without punctuation

  • batch_size – batch size to use during inference

  • max_seq_length – maximum sequence length of segment after tokenization.

  • step – relative shift of consequent segments into which long queries are split. Long queries are split into segments which can overlap. Parameter step controls such overlapping. Imagine that queries are tokenized into characters, max_seq_length=5, and step=2. In such a case query “hello” is tokenized into segments [['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']].

  • margin – number of subtokens in the beginning and the end of segments which are not used for prediction computation. The first segment does not have left margin and the last segment does not have right margin. For example, if input sequence is tokenized into characters, max_seq_length=5, step=1, and margin=1, then query “hello” will be tokenized into segments [['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'e', 'l', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']]. These segments are passed to the model. Before final predictions computation, margins are removed. In the next list, subtokens which logits are not used for final predictions computation are marked with asterisk: [['[CLS]'*, 'h', 'e', 'l'*, '[SEP]'*], ['[CLS]'*, 'e'*, 'l', 'l'*, '[SEP]'*], ['[CLS]'*, 'l'*, 'l', 'o', '[SEP]'*]].

  • return_labels – whether to return labels in NeMo format (see https://docs.nvidia.com/deeplearning/nemo/ user-guide/docs/en/main/nlp/punctuation_and_capitalization.html#nemo-data-format) instead of queries with restored punctuation and capitalization.

Returns

text with added capitalization and punctuation or punctuation and capitalization labels

Return type

result

apply_punct_capit_predictions(query: str, punct_preds: List[int], capit_preds: List[int]) str[source]

Restores punctuation and capitalization in query. :param query: a string without punctuation and capitalization :param punct_preds: ids of predicted punctuation labels :param capit_preds: ids of predicted capitalization labels

Returns

a query with restored punctuation and capitalization

forward(input_ids, attention_mask, token_type_ids=None)[source]

No special modification required for Lightning, define it as you normally would in the nn.Module in vanilla PyTorch.

get_labels(punct_preds: List[int], capit_preds: List[int]) str[source]

Returns punctuation and capitalization labels in NeMo format (see https://docs.nvidia.com/deeplearning/nemo/ user-guide/docs/en/main/nlp/punctuation_and_capitalization.html#nemo-data-format). :param punct_preds: ids of predicted punctuation labels :param capit_preds: ids of predicted capitalization labels

Returns

labels in NeMo format

property input_module
property input_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable input neural type checks

classmethod list_available_models() Optional[Dict[str, str]][source]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns

List of available pre-trained models.

multi_test_epoch_end(outputs, dataloader_idx: int = 0)[source]

Called at the end of test to aggregate outputs. outputs: list of individual outputs of each validation step.

multi_validation_epoch_end(outputs, dataloader_idx: int = 0)[source]

Called at the end of validation to aggregate outputs. outputs: list of individual outputs of each validation step.

property output_module
property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

setup_test_data(test_data_config: Optional[Dict] = None)[source]

(Optionally) Setups data loader to be used in test

Parameters

test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_config: Optional[omegaconf.DictConfig] = None)[source]

Setup training data

setup_validation_data(val_data_config: Optional[Dict] = None)[source]

Setup validaton data

val_data_config: validation data config

test_step(batch, batch_idx, dataloader_idx=0)[source]

Lightning calls this inside the validation loop with the data from the validation dataloader passed in as batch.

training_step(batch, batch_idx)[source]

Lightning calls this inside the training loop with the data from the training dataloader passed in as batch.

update_data_dir(data_dir: str) None[source]

Update data directory

Parameters

data_dir – path to data directory

validation_step(batch, batch_idx, dataloader_idx=0)[source]

Lightning calls this inside the validation loop with the data from the validation dataloader passed in as batch.

class nemo.collections.nlp.models.TokenClassificationModel(*args: Any, **kwargs: Any)[source]

Bases: nemo.collections.nlp.models.nlp_model.NLPModel

Token Classification Model with BERT, applicable for tasks such as Named Entity Recognition

add_predictions(queries: Union[List[str], str], batch_size: int = 32, output_file: Optional[str] = None) List[str][source]

Add predicted token labels to the queries. Use this method for debugging and prototyping. :param queries: text :param batch_size: batch size to use during inference. :param output_file: file to save models predictions

Returns

text with added entities

Return type

result

evaluate_from_file(output_dir: str, text_file: str, labels_file: Optional[str] = None, add_confusion_matrix: Optional[bool] = False, normalize_confusion_matrix: Optional[bool] = True, batch_size: int = 1) None[source]

Run inference on data from a file, plot confusion matrix and calculate classification report. Use this method for final evaluation.

Parameters
  • output_dir – path to output directory to store model predictions, confusion matrix plot (if set to True)

  • text_file – path to file with text. Each line of the text.txt file contains text sequences, where words are separated with spaces: [WORD] [SPACE] [WORD] [SPACE] [WORD]

  • labels_file (Optional) – path to file with labels. Each line of the labels_file should contain labels corresponding to each word in the text_file, the labels are separated with spaces: [LABEL] [SPACE] [LABEL] [SPACE] [LABEL] (for labels.txt).’

  • add_confusion_matrix – whether to generate confusion matrix

  • normalize_confusion_matrix – whether to normalize confusion matrix

  • batch_size – batch size to use during inference.

forward(input_ids, token_type_ids, attention_mask)[source]
property input_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable input neural type checks

classmethod list_available_models() Optional[nemo.core.classes.common.PretrainedModelInfo][source]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns

List of available pre-trained models.

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

setup_loss(class_balancing: Optional[str] = None)[source]
Setup loss

Setup or update loss.

Parameters

class_balancing – whether to use class weights during training

setup_test_data(test_data_config: Optional[omegaconf.DictConfig] = None)[source]

(Optionally) Setups data loader to be used in test

Parameters

test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_config: Optional[omegaconf.DictConfig] = None)[source]

Setups data loader to be used in training

Parameters

train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_config: Optional[omegaconf.DictConfig] = None)[source]

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

test_epoch_end(outputs)[source]

Default DataLoader for Test set which automatically supports multiple data loaders via multi_test_epoch_end.

If multi dataset support is not required, override this method entirely in base class. In such a case, there is no need to implement multi_test_epoch_end either.

Note

If more than one data loader exists, and they all provide test_loss, only the test_loss of the first data loader will be used by default. This default can be changed by passing the special key test_dl_idx: int inside the test_ds config.

Parameters

outputs – Single or nested list of tensor outputs from one or more data loaders.

Returns

A dictionary containing the union of all items from individual data_loaders, along with merged logs from all data loaders.

test_step(batch, batch_idx)[source]
training_step(batch, batch_idx)[source]

Lightning calls this inside the training loop with the data from the training dataloader passed in as batch.

update_data_dir(data_dir: str) None[source]

Update data directory and get data stats with Data Descriptor Weights are later used to setup loss

Parameters

data_dir – path to data directory

validation_epoch_end(outputs)[source]

Called at the end of validation to aggregate outputs. outputs: list of individual outputs of each validation step.

validation_step(batch, batch_idx)[source]

Lightning calls this inside the validation loop with the data from the validation dataloader passed in as batch.

class nemo.collections.nlp.models.QAModel(*args: Any, **kwargs: Any)[source]

Bases: nemo.collections.nlp.models.nlp_model.NLPModel

BERT encoder with QA head training.

forward(input_ids, token_type_ids, attention_mask)[source]
inference(file: str, batch_size: int = 1, num_samples: int = - 1, output_nbest_file: Optional[str] = None, output_prediction_file: Optional[str] = None)

Get prediction for unlabeled inference data

Parameters
  • file – inference data

  • batch_size – batch size to use during inference

  • num_samples – number of samples to use of inference data. Default: -1 if all data should be used.

  • output_nbest_file – optional output file for writing out nbest list

  • output_prediction_file – optional output file for writing out predictions

Returns

model predictions, model nbest list

property input_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable input neural type checks

classmethod list_available_models() Optional[nemo.core.classes.common.PretrainedModelInfo][source]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns

List of available pre-trained models.

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

setup_test_data(test_data_config: Optional[omegaconf.DictConfig])[source]

(Optionally) Setups data loader to be used in test

Parameters

test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_config: Optional[omegaconf.DictConfig])[source]

Setups data loader to be used in training

Parameters

train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_config: Optional[omegaconf.DictConfig])[source]

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

test_epoch_end(outputs)[source]

Default DataLoader for Test set which automatically supports multiple data loaders via multi_test_epoch_end.

If multi dataset support is not required, override this method entirely in base class. In such a case, there is no need to implement multi_test_epoch_end either.

Note

If more than one data loader exists, and they all provide test_loss, only the test_loss of the first data loader will be used by default. This default can be changed by passing the special key test_dl_idx: int inside the test_ds config.

Parameters

outputs – Single or nested list of tensor outputs from one or more data loaders.

Returns

A dictionary containing the union of all items from individual data_loaders, along with merged logs from all data loaders.

test_step(batch, batch_idx)[source]
training_step(batch, batch_idx)[source]
validation_epoch_end(outputs)[source]

Default DataLoader for Validation set which automatically supports multiple data loaders via multi_validation_epoch_end.

If multi dataset support is not required, override this method entirely in base class. In such a case, there is no need to implement multi_validation_epoch_end either.

Note

If more than one data loader exists, and they all provide val_loss, only the val_loss of the first data loader will be used by default. This default can be changed by passing the special key val_dl_idx: int inside the validation_ds config.

Parameters

outputs – Single or nested list of tensor outputs from one or more data loaders.

Returns

A dictionary containing the union of all items from individual data_loaders, along with merged logs from all data loaders.

validation_step(batch, batch_idx)[source]
class nemo.collections.nlp.models.BERTLMModel(*args: Any, **kwargs: Any)[source]

Bases: nemo.core.classes.modelPT.ModelPT

BERT language model pretraining.

forward(input_ids, token_type_ids, attention_mask)[source]

No special modification required for Lightning, define it as you normally would in the nn.Module in vanilla PyTorch.

property input_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable input neural type checks

classmethod list_available_models() Optional[nemo.core.classes.common.PretrainedModelInfo][source]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns

List of available pre-trained models.

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

setup_test_data(test_data_config: Optional[omegaconf.DictConfig])[source]

(Optionally) Setups data loader to be used in test

Parameters

test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_config: Optional[omegaconf.DictConfig])[source]

Setups data loader to be used in training

Parameters

train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_config: Optional[omegaconf.DictConfig])[source]

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

training_step(batch, batch_idx)[source]

Lightning calls this inside the training loop with the data from the training dataloader passed in as batch.

validation_epoch_end(outputs)[source]

Called at the end of validation to aggregate outputs.

Parameters

outputs (list) – The individual outputs of each validation step.

Returns

Validation loss and tensorboard logs.

Return type

dict

validation_step(batch, batch_idx)[source]

Lightning calls this inside the validation loop with the data from the validation dataloader passed in as batch.

Modules

class nemo.collections.nlp.modules.BertModule(*args: Any, **kwargs: Any)[source]

Bases: nemo.core.classes.module.NeuralModule, nemo.core.classes.exportable.Exportable

input_example()[source]

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable input neural type checks

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

restore_weights(restore_path: str)[source]

Restores module/model’s weights

class nemo.collections.nlp.modules.AlbertEncoder(*args: Any, **kwargs: Any)[source]

Bases: transformers.AlbertModel, nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, attention_mask, token_type_ids)[source]
class nemo.collections.nlp.modules.BertEncoder(*args: Any, **kwargs: Any)[source]

Bases: transformers.BertModel, nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, attention_mask=None, token_type_ids=None)[source]
class nemo.collections.nlp.modules.DistilBertEncoder(*args: Any, **kwargs: Any)[source]

Bases: transformers.DistilBertModel, nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, attention_mask, token_type_ids=None)[source]
class nemo.collections.nlp.modules.RobertaEncoder(*args: Any, **kwargs: Any)[source]

Bases: transformers.RobertaModel, nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, token_type_ids, attention_mask)[source]
class nemo.collections.nlp.modules.SequenceClassifier(*args: Any, **kwargs: Any)[source]

Bases: nemo.collections.nlp.modules.common.classifier.Classifier

forward(hidden_states)[source]
property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

class nemo.collections.nlp.modules.SequenceRegression(*args: Any, **kwargs: Any)[source]

Bases: nemo.collections.nlp.modules.common.classifier.Classifier

Parameters
  • hidden_size – the hidden size of the mlp head on the top of the encoder

  • num_layers – number of the linear layers of the mlp head on the top of the encoder

  • activation – type of activations between layers of the mlp head

  • dropout – the dropout used for the mlp head

  • use_transformer_init – initializes the weights with the same approach used in Transformer

  • idx_conditioned_on – index of the token to use as the sequence representation for the classification task, default is the first token

forward(hidden_states: torch.Tensor) torch.Tensor[source]

Forward pass through the module.

Parameters

hidden_states – hidden states for each token in a sequence, for example, BERT module output

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

class nemo.collections.nlp.modules.SequenceTokenClassifier(*args: Any, **kwargs: Any)[source]

Bases: nemo.collections.nlp.modules.common.classifier.Classifier

forward(hidden_states)[source]
property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

nemo.collections.nlp.modules.get_lm_model(pretrained_model_name: str, config_dict: Optional[dict] = None, config_file: Optional[str] = None, checkpoint_file: Optional[str] = None, vocab_file: Optional[str] = None) nemo.collections.nlp.modules.common.bert_module.BertModule[source]

Helper function to instantiate a language model encoder, either from scratch or a pretrained model. If only pretrained_model_name are passed, a pretrained model is returned. If a configuration is passed, whether as a file or dictionary, the model is initialized with random weights.

Parameters
  • pretrained_model_name – pretrained model name, for example, bert-base-uncased or megatron-bert-cased. See get_pretrained_lm_models_list() for full list.

  • config_dict – path to the model configuration dictionary

  • config_file – path to the model configuration file

  • checkpoint_file – path to the pretrained model checkpoint

  • vocab_file – path to vocab_file to be used with Megatron-LM

Returns

Pretrained BertModule

nemo.collections.nlp.modules.get_pretrained_lm_models_list(include_external: bool = False) List[str][source]

Returns the list of supported pretrained model names

Parameters
  • names (include_external if true includes all HuggingFace model) –

  • NeMo. (not only those supported language models in) –