NeMo NLP collection API¶

Model Classes¶

class nemo.collections.nlp.models.TextClassificationModel(*args: Any, **kwargs: Any)[source]¶

Bases: pytorch_lightning., nemo.core.classes.common.Model

classifytext(queries: List[str], batch_size: int = 1, max_seq_length: int = - 1) → List[int]¶

Get prediction for the queries :param queries: text sequences :param batch_size: batch size to use during inference :param max_seq_length: sequences longer than max_seq_length will get truncated. default -1 disables truncation.

Returns: model predictions
Return type: all_preds

create_loss_module()[source]¶

forward(input_ids, token_type_ids, attention_mask)[source]¶: No special modification required for Lightning, define it as you normally would in the nn.Module in vanilla PyTorch.

classmethod from_pretrained(name: str)[source]¶

Instantiates an instance of NeMo from NVIDIA NGC cloud Use restore_from() to instantiate from a local .nemo file. :param model_name: string key which will be used to find the module. :param refresh_cache: If set to True, then when fetching from cloud, this will re-fetch the file

from cloud even if it is already found in a cache locally.

Parameters

override_config_path – path to a yaml config that will override the internal config file
map_location – Optional torch.device() to map the instantiated model to a device. By default (None), it will select a GPU if available, falling back to CPU otherwise.
strict – Passed to torch.load_state_dict
return_config – If set to true, will return just the underlying config of the restored model as an OmegaConf DictConfig object without instantiating the model.

Returns

A model instance of a particular model class or its underlying config (if return_config is set).

property input_types¶: Define these to enable input neural type checks

classmethod list_available_models() → Optional[Dict[str, str]][source]¶

Should list all pre-trained models available via NVIDIA NGC cloud

Returns: A list of PretrainedModelInfo entries

property output_types¶: Define these to enable output neural type checks

setup_test_data(test_data_config: Optional[omegaconf.DictConfig])[source]¶

(Optionally) Setups data loader to be used in test

Parameters: test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_config: Optional[omegaconf.DictConfig])[source]¶

Setups data loader to be used in training

Parameters: train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_config: Optional[omegaconf.DictConfig])[source]¶

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

test_epoch_end(outputs)[source]¶: Called at the end of test to aggregate outputs. :param outputs: list of individual outputs of each test step.

test_step(batch, batch_idx)[source]¶: Lightning calls this inside the test loop with the data from the test dataloader passed in as batch.

training_step(batch, batch_idx)[source]¶: Lightning calls this inside the training loop with the data from the training dataloader passed in as batch.

validation_epoch_end(outputs)[source]¶: Called at the end of validation to aggregate outputs. :param outputs: list of individual outputs of each validation step.

validation_step(batch, batch_idx)[source]¶: Lightning calls this inside the validation loop with the data from the validation dataloader passed in as batch.

class nemo.collections.nlp.models.GLUEModel(*args: Any, **kwargs: Any)[source]¶

Bases: pytorch_lightning., nemo.core.classes.common.Model

forward(input_ids, token_type_ids, attention_mask)[source]¶

property input_types¶: Define these to enable input neural type checks

classmethod list_available_models() → Optional[Dict[str, str]][source]¶

Should list all pre-trained models available via NVIDIA NGC cloud

Returns: A list of PretrainedModelInfo entries

multi_validation_epoch_end(outputs, dataloader_idx: int = 0)[source]¶: Called at the end of validation to aggregate outputs. outputs: list of individual outputs of each validation step.

property output_types¶: Define these to enable output neural type checks

setup_multiple_validation_data(val_data_config: Optional[Union[omegaconf.DictConfig, Dict]] = None)[source]¶

(Optionally) Setups data loader to be used in validation, with support for multiple data loaders.

Parameters: val_data_layer_config – validation data layer parameters.

setup_training_data(train_data_config: Optional[omegaconf.DictConfig] = None)[source]¶

Setups data loader to be used in training

Parameters: train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_config: Optional[omegaconf.DictConfig] = None)[source]¶

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

training_step(batch, batch_idx)[source]¶

update_data_dir(data_dir: str) → None[source]¶

Update data directory and get data stats with Data Descriptor Weights are later used to setup loss

Parameters: data_dir – path to data directory

validation_step(batch, batch_idx, dataloader_idx=0)[source]¶

class nemo.collections.nlp.models.PunctuationCapitalizationModel(*args: Any, **kwargs: Any)[source]¶

Bases: pytorch_lightning., nemo.core.classes.common.Model

add_punctuation_capitalization(queries: List[str], batch_size: Optional[int] = None, max_seq_length: int = 512) → List[str][source]¶

Adds punctuation and capitalization to the queries. Use this method for debugging and prototyping. :param queries: lower cased text without punctuation :param batch_size: batch size to use during inference :param max_seq_length: maximum sequence length after tokenization

Returns: text with added capitalization and punctuation
Return type: result

forward(input_ids, attention_mask, token_type_ids=None)[source]¶: No special modification required for Lightning, define it as you normally would in the nn.Module in vanilla PyTorch.

property input_module¶

property input_types¶: Define these to enable input neural type checks

classmethod list_available_models() → Optional[Dict[str, str]][source]¶

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns: List of available pre-trained models.

multi_test_epoch_end(outputs, dataloader_idx: int = 0)[source]¶: Called at the end of test to aggregate outputs. outputs: list of individual outputs of each validation step.

multi_validation_epoch_end(outputs, dataloader_idx: int = 0)[source]¶: Called at the end of validation to aggregate outputs. outputs: list of individual outputs of each validation step.

property output_module¶

property output_types¶: Define these to enable output neural type checks

setup_test_data(test_data_config: Optional[Dict] = None)[source]¶

(Optionally) Setups data loader to be used in test

Parameters: test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_config: Optional[omegaconf.DictConfig] = None)[source]¶: Setup training data

setup_validation_data(val_data_config: Optional[Dict] = None)[source]¶

Setup validaton data

val_data_config: validation data config

test_step(batch, batch_idx, dataloader_idx=0)[source]¶: Lightning calls this inside the validation loop with the data from the validation dataloader passed in as batch.

training_step(batch, batch_idx)[source]¶: Lightning calls this inside the training loop with the data from the training dataloader passed in as batch.

update_data_dir(data_dir: str) → None[source]¶

Update data directory

Parameters: data_dir – path to data directory

validation_step(batch, batch_idx, dataloader_idx=0)[source]¶: Lightning calls this inside the validation loop with the data from the validation dataloader passed in as batch.

class nemo.collections.nlp.models.TokenClassificationModel(*args: Any, **kwargs: Any)[source]¶

Bases: pytorch_lightning., nemo.core.classes.common.Model

Token Classification Model with BERT, applicable for tasks such as Named Entity Recognition

add_predictions(queries: Union[List[str], str], batch_size: int = 32, output_file: Optional[str] = None) → List[str][source]¶

Add predicted token labels to the queries. Use this method for debugging and prototyping. :param queries: text :param batch_size: batch size to use during inference. :param output_file: file to save models predictions

Returns: text with added entities
Return type: result

evaluate_from_file(output_dir: str, text_file: str, labels_file: Optional[str] = None, add_confusion_matrix: Optional[bool] = False, normalize_confusion_matrix: Optional[bool] = True, batch_size: int = 1) → None[source]¶

Run inference on data from a file, plot confusion matrix and calculate classification report. Use this method for final evaluation.

Parameters

output_dir – path to output directory to store model predictions, confusion matrix plot (if set to True)
text_file – path to file with text. Each line of the text.txt file contains text sequences, where words are separated with spaces: [WORD] [SPACE] [WORD] [SPACE] [WORD]
labels_file (Optional) – path to file with labels. Each line of the labels_file should contain labels corresponding to each word in the text_file, the labels are separated with spaces: [LABEL] [SPACE] [LABEL] [SPACE] [LABEL] (for labels.txt).’
add_confusion_matrix – whether to generate confusion matrix
normalize_confusion_matrix – whether to normalize confusion matrix
batch_size – batch size to use during inference.

forward(input_ids, token_type_ids, attention_mask)[source]¶

property input_types¶: Define these to enable input neural type checks

classmethod list_available_models() → Optional[nemo.core.classes.common.PretrainedModelInfo][source]¶

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns: List of available pre-trained models.

property output_types¶: Define these to enable output neural type checks

setup_loss(class_balancing: Optional[str] = None)[source]¶

Setup loss: Setup or update loss.

Parameters: class_balancing – whether to use class weights during training

setup_test_data(test_data_config: Optional[omegaconf.DictConfig] = None)[source]¶

(Optionally) Setups data loader to be used in test

Parameters: test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_config: Optional[omegaconf.DictConfig] = None)[source]¶

Setups data loader to be used in training

Parameters: train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_config: Optional[omegaconf.DictConfig] = None)[source]¶

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

test_epoch_end(outputs)[source]¶

Default DataLoader for Test set which automatically supports multiple data loaders via multi_test_epoch_end.

If multi dataset support is not required, override this method entirely in base class. In such a case, there is no need to implement multi_test_epoch_end either.

Note

If more than one data loader exists, and they all provide test_loss, only the test_loss of the first data loader will be used by default. This default can be changed by passing the special key test_dl_idx: int inside the test_ds config.

Parameters: outputs – Single or nested list of tensor outputs from one or more data loaders.
Returns: A dictionary containing the union of all items from individual data_loaders, along with merged logs from all data loaders.

test_step(batch, batch_idx)[source]¶

training_step(batch, batch_idx)[source]¶: Lightning calls this inside the training loop with the data from the training dataloader passed in as batch.

update_data_dir(data_dir: str) → None[source]¶

Update data directory and get data stats with Data Descriptor Weights are later used to setup loss

Parameters: data_dir – path to data directory

validation_epoch_end(outputs)[source]¶: Called at the end of validation to aggregate outputs. outputs: list of individual outputs of each validation step.

validation_step(batch, batch_idx)[source]¶: Lightning calls this inside the validation loop with the data from the validation dataloader passed in as batch.

class nemo.collections.nlp.models.QAModel(*args: Any, **kwargs: Any)[source]¶

Bases: pytorch_lightning., nemo.core.classes.common.Model

BERT encoder with QA head training.

forward(input_ids, token_type_ids, attention_mask)[source]¶

inference(file: str, batch_size: int = 1, num_samples: int = - 1, output_nbest_file: Optional[str] = None, output_prediction_file: Optional[str] = None)¶

Get prediction for unlabeled inference data

Parameters

file – inference data
batch_size – batch size to use during inference
num_samples – number of samples to use of inference data. Default: -1 if all data should be used.
output_nbest_file – optional output file for writing out nbest list
output_prediction_file – optional output file for writing out predictions

Returns

model predictions, model nbest list

property input_types¶: Define these to enable input neural type checks

classmethod list_available_models() → Optional[nemo.core.classes.common.PretrainedModelInfo][source]¶

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns: List of available pre-trained models.

property output_types¶: Define these to enable output neural type checks

setup_test_data(test_data_config: Optional[omegaconf.DictConfig])[source]¶

(Optionally) Setups data loader to be used in test

Parameters: test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_config: Optional[omegaconf.DictConfig])[source]¶

Setups data loader to be used in training

Parameters: train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_config: Optional[omegaconf.DictConfig])[source]¶

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

test_epoch_end(outputs)[source]¶

Default DataLoader for Test set which automatically supports multiple data loaders via multi_test_epoch_end.

If multi dataset support is not required, override this method entirely in base class. In such a case, there is no need to implement multi_test_epoch_end either.

Note

If more than one data loader exists, and they all provide test_loss, only the test_loss of the first data loader will be used by default. This default can be changed by passing the special key test_dl_idx: int inside the test_ds config.

Parameters: outputs – Single or nested list of tensor outputs from one or more data loaders.
Returns: A dictionary containing the union of all items from individual data_loaders, along with merged logs from all data loaders.

test_step(batch, batch_idx)[source]¶

training_step(batch, batch_idx)[source]¶

validation_epoch_end(outputs)[source]¶

Default DataLoader for Validation set which automatically supports multiple data loaders via multi_validation_epoch_end.

If multi dataset support is not required, override this method entirely in base class. In such a case, there is no need to implement multi_validation_epoch_end either.

Note

If more than one data loader exists, and they all provide val_loss, only the val_loss of the first data loader will be used by default. This default can be changed by passing the special key val_dl_idx: int inside the validation_ds config.

Parameters: outputs – Single or nested list of tensor outputs from one or more data loaders.
Returns: A dictionary containing the union of all items from individual data_loaders, along with merged logs from all data loaders.

validation_step(batch, batch_idx)[source]¶

class nemo.collections.nlp.models.BERTLMModel(*args: Any, **kwargs: Any)[source]¶

Bases: pytorch_lightning., nemo.core.classes.common.Model

BERT language model pretraining.

forward(input_ids, token_type_ids, attention_mask)[source]¶: No special modification required for Lightning, define it as you normally would in the nn.Module in vanilla PyTorch.

property input_types¶: Define these to enable input neural type checks

classmethod list_available_models() → Optional[nemo.core.classes.common.PretrainedModelInfo][source]¶

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns: List of available pre-trained models.

property output_types¶: Define these to enable output neural type checks

setup_test_data(test_data_config: Optional[omegaconf.DictConfig])[source]¶

(Optionally) Setups data loader to be used in test

Parameters: test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_config: Optional[omegaconf.DictConfig])[source]¶

Setups data loader to be used in training

Parameters: train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_config: Optional[omegaconf.DictConfig])[source]¶

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

training_step(batch, batch_idx)[source]¶: Lightning calls this inside the training loop with the data from the training dataloader passed in as batch.

validation_epoch_end(outputs)[source]¶

Called at the end of validation to aggregate outputs.

Parameters: outputs (list) – The individual outputs of each validation step.
Returns: Validation loss and tensorboard logs.
Return type: dict

validation_step(batch, batch_idx)[source]¶: Lightning calls this inside the validation loop with the data from the validation dataloader passed in as batch.

Modules¶

class nemo.collections.nlp.modules.BertModule(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization, nemo.core.classes.common.FileIO

input_example()[source]¶: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types¶: Define these to enable input neural type checks

property output_types¶: Define these to enable output neural type checks

classmethod restore_from(restore_path: str)[source]¶: Restores module/model with weights

restore_weights(restore_path: str)[source]¶: Restores module/model’s weights

classmethod save_to(save_path: str)[source]¶: Saves module/model with weights

class nemo.collections.nlp.modules.MegatronBertEncoder(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization, nemo.core.classes.common.FileIO

MegatronBERT wraps around the Megatron Language model from https://github.com/NVIDIA/Megatron-LM

Parameters

config_file (str) – path to model configuration file.
vocab_file (str) – path to vocabulary file.
tokenizer_type (str) – tokenizer type, currently only ‘BertWordPieceLowerCase’ supported.

complete_lazy_init()[source]¶

forward(input_ids, attention_mask, token_type_ids)[source]¶

property hidden_size¶

Property returning hidden size.

Returns: Hidden size.

restore_weights(restore_path: str)[source]¶

Restores module/model’s weights.: For model parallel checkpoints the directory structure should be restore_path/mp_rank_0X/model_optim_rng.pt

Parameters: restore_path (str) – restore_path should a file or a directory if using model parallel

class nemo.collections.nlp.modules.AlbertEncoder(*args: Any, **kwargs: Any)[source]¶

Bases: transformers., nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, attention_mask, token_type_ids)[source]¶

class nemo.collections.nlp.modules.BertEncoder(*args: Any, **kwargs: Any)[source]¶

Bases: transformers., nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, attention_mask, token_type_ids)[source]¶

class nemo.collections.nlp.modules.DistilBertEncoder(*args: Any, **kwargs: Any)[source]¶

Bases: transformers., nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, attention_mask, token_type_ids=None)[source]¶

class nemo.collections.nlp.modules.RobertaEncoder(*args: Any, **kwargs: Any)[source]¶

Bases: transformers., nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, token_type_ids, attention_mask)[source]¶

class nemo.collections.nlp.modules.SequenceClassifier(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization, nemo.core.classes.common.FileIO

forward(hidden_states)[source]¶

property output_types¶: Define these to enable output neural type checks

class nemo.collections.nlp.modules.SequenceRegression(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization, nemo.core.classes.common.FileIO

Parameters

hidden_size – the hidden size of the mlp head on the top of the encoder
num_layers – number of the linear layers of the mlp head on the top of the encoder
activation – type of activations between layers of the mlp head
dropout – the dropout used for the mlp head
use_transformer_init – initializes the weights with the same approach used in Transformer
idx_conditioned_on – index of the token to use as the sequence representation for the classification task, default is the first token

forward(hidden_states: torch.Tensor) → torch.Tensor [source]¶

Forward pass through the module.

Parameters: hidden_states – hidden states for each token in a sequence, for example, BERT module output

property output_types¶: Define these to enable output neural type checks

class nemo.collections.nlp.modules.SequenceTokenClassifier(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization, nemo.core.classes.common.FileIO

forward(hidden_states)[source]¶

property output_types¶: Define these to enable output neural type checks

nemo.collections.nlp.modules.get_lm_model(pretrained_model_name: str, config_dict: Optional[dict] = None, config_file: Optional[str] = None, checkpoint_file: Optional[str] = None, vocab_file: Optional[str] = None) → nemo.collections.nlp.modules.common.bert_module.BertModule[source]¶

Helper function to instantiate a language model encoder, either from scratch or a pretrained model. If only pretrained_model_name are passed, a pretrained model is returned. If a configuration is passed, whether as a file or dictionary, the model is initialized with random weights.

Parameters

pretrained_model_name – pretrained model name, for example, bert-base-uncased or megatron-bert-cased. See get_pretrained_lm_models_list() for full list.
config_dict – path to the model configuration dictionary
config_file – path to the model configuration file
checkpoint_file – path to the pretrained model checkpoint
vocab_file – path to vocab_file to be used with Megatron-LM

Returns

Pretrained BertModule

nemo.collections.nlp.modules.get_pretrained_lm_models_list(include_external: bool = False) → List[str][source]¶

Returns the list of supported pretrained model names

Parameters

if true includes all HuggingFace model names (include_external) –
only those supported language models in NeMo. (not) –

nemo.collections.nlp.modules.get_megatron_lm_models_list() → List[str][source]¶: Returns the list of supported Megatron-LM models