NeMo NLP collection API¶
Model Classes¶
-
class
nemo.collections.nlp.models.
TextClassificationModel
(*args: Any, **kwargs: Any)[source]¶ Bases:
pytorch_lightning.
,nemo.core.classes.common.Model
-
classifytext
(queries: List[str], batch_size: int = 1, max_seq_length: int = - 1) → List[int]¶ Get prediction for the queries :param queries: text sequences :param batch_size: batch size to use during inference :param max_seq_length: sequences longer than max_seq_length will get truncated. default -1 disables truncation.
- Returns
model predictions
- Return type
all_preds
-
forward
(input_ids, token_type_ids, attention_mask)[source]¶ No special modification required for Lightning, define it as you normally would in the nn.Module in vanilla PyTorch.
-
classmethod
from_pretrained
(name: str)[source]¶ Instantiates an instance of NeMo from NVIDIA NGC cloud Use restore_from() to instantiate from a local .nemo file. :param model_name: string key which will be used to find the module. :param refresh_cache: If set to True, then when fetching from cloud, this will re-fetch the file
from cloud even if it is already found in a cache locally.
- Parameters
override_config_path – path to a yaml config that will override the internal config file
map_location – Optional torch.device() to map the instantiated model to a device. By default (None), it will select a GPU if available, falling back to CPU otherwise.
strict – Passed to torch.load_state_dict
return_config – If set to true, will return just the underlying config of the restored model as an OmegaConf DictConfig object without instantiating the model.
- Returns
A model instance of a particular model class or its underlying config (if return_config is set).
-
property
input_types
¶ Define these to enable input neural type checks
-
classmethod
list_available_models
() → Optional[Dict[str, str]][source]¶ Should list all pre-trained models available via NVIDIA NGC cloud
- Returns
A list of PretrainedModelInfo entries
-
property
output_types
¶ Define these to enable output neural type checks
-
setup_test_data
(test_data_config: Optional[omegaconf.DictConfig])[source]¶ (Optionally) Setups data loader to be used in test
- Parameters
test_data_layer_config – test data layer parameters.
Returns:
-
setup_training_data
(train_data_config: Optional[omegaconf.DictConfig])[source]¶ Setups data loader to be used in training
- Parameters
train_data_layer_config – training data layer parameters.
Returns:
-
setup_validation_data
(val_data_config: Optional[omegaconf.DictConfig])[source]¶ Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.
Returns:
-
test_epoch_end
(outputs)[source]¶ Called at the end of test to aggregate outputs. :param outputs: list of individual outputs of each test step.
-
test_step
(batch, batch_idx)[source]¶ Lightning calls this inside the test loop with the data from the test dataloader passed in as batch.
-
training_step
(batch, batch_idx)[source]¶ Lightning calls this inside the training loop with the data from the training dataloader passed in as batch.
-
-
class
nemo.collections.nlp.models.
GLUEModel
(*args: Any, **kwargs: Any)[source]¶ Bases:
pytorch_lightning.
,nemo.core.classes.common.Model
-
property
input_types
¶ Define these to enable input neural type checks
-
classmethod
list_available_models
() → Optional[Dict[str, str]][source]¶ Should list all pre-trained models available via NVIDIA NGC cloud
- Returns
A list of PretrainedModelInfo entries
-
multi_validation_epoch_end
(outputs, dataloader_idx: int = 0)[source]¶ Called at the end of validation to aggregate outputs. outputs: list of individual outputs of each validation step.
-
property
output_types
¶ Define these to enable output neural type checks
-
setup_multiple_validation_data
(val_data_config: Optional[Union[omegaconf.DictConfig, Dict]] = None)[source]¶ (Optionally) Setups data loader to be used in validation, with support for multiple data loaders.
- Parameters
val_data_layer_config – validation data layer parameters.
-
setup_training_data
(train_data_config: Optional[omegaconf.DictConfig] = None)[source]¶ Setups data loader to be used in training
- Parameters
train_data_layer_config – training data layer parameters.
Returns:
-
setup_validation_data
(val_data_config: Optional[omegaconf.DictConfig] = None)[source]¶ Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.
Returns:
-
property
-
class
nemo.collections.nlp.models.
PunctuationCapitalizationModel
(*args: Any, **kwargs: Any)[source]¶ Bases:
pytorch_lightning.
,nemo.core.classes.common.Model
-
add_punctuation_capitalization
(queries: List[str], batch_size: Optional[int] = None, max_seq_length: int = 512) → List[str][source]¶ Adds punctuation and capitalization to the queries. Use this method for debugging and prototyping. :param queries: lower cased text without punctuation :param batch_size: batch size to use during inference :param max_seq_length: maximum sequence length after tokenization
- Returns
text with added capitalization and punctuation
- Return type
result
-
forward
(input_ids, attention_mask, token_type_ids=None)[source]¶ No special modification required for Lightning, define it as you normally would in the nn.Module in vanilla PyTorch.
-
property
input_module
¶
-
property
input_types
¶ Define these to enable input neural type checks
-
classmethod
list_available_models
() → Optional[Dict[str, str]][source]¶ This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.
- Returns
List of available pre-trained models.
-
multi_test_epoch_end
(outputs, dataloader_idx: int = 0)[source]¶ Called at the end of test to aggregate outputs. outputs: list of individual outputs of each validation step.
-
multi_validation_epoch_end
(outputs, dataloader_idx: int = 0)[source]¶ Called at the end of validation to aggregate outputs. outputs: list of individual outputs of each validation step.
-
property
output_module
¶
-
property
output_types
¶ Define these to enable output neural type checks
-
setup_test_data
(test_data_config: Optional[Dict] = None)[source]¶ (Optionally) Setups data loader to be used in test
- Parameters
test_data_layer_config – test data layer parameters.
Returns:
-
setup_training_data
(train_data_config: Optional[omegaconf.DictConfig] = None)[source]¶ Setup training data
-
setup_validation_data
(val_data_config: Optional[Dict] = None)[source]¶ Setup validaton data
val_data_config: validation data config
-
test_step
(batch, batch_idx, dataloader_idx=0)[source]¶ Lightning calls this inside the validation loop with the data from the validation dataloader passed in as batch.
-
training_step
(batch, batch_idx)[source]¶ Lightning calls this inside the training loop with the data from the training dataloader passed in as batch.
-
-
class
nemo.collections.nlp.models.
TokenClassificationModel
(*args: Any, **kwargs: Any)[source]¶ Bases:
pytorch_lightning.
,nemo.core.classes.common.Model
Token Classification Model with BERT, applicable for tasks such as Named Entity Recognition
-
add_predictions
(queries: Union[List[str], str], batch_size: int = 32, output_file: Optional[str] = None) → List[str][source]¶ Add predicted token labels to the queries. Use this method for debugging and prototyping. :param queries: text :param batch_size: batch size to use during inference. :param output_file: file to save models predictions
- Returns
text with added entities
- Return type
result
-
evaluate_from_file
(output_dir: str, text_file: str, labels_file: Optional[str] = None, add_confusion_matrix: Optional[bool] = False, normalize_confusion_matrix: Optional[bool] = True, batch_size: int = 1) → None[source]¶ Run inference on data from a file, plot confusion matrix and calculate classification report. Use this method for final evaluation.
- Parameters
output_dir – path to output directory to store model predictions, confusion matrix plot (if set to True)
text_file – path to file with text. Each line of the text.txt file contains text sequences, where words are separated with spaces: [WORD] [SPACE] [WORD] [SPACE] [WORD]
labels_file (Optional) – path to file with labels. Each line of the labels_file should contain labels corresponding to each word in the text_file, the labels are separated with spaces: [LABEL] [SPACE] [LABEL] [SPACE] [LABEL] (for labels.txt).’
add_confusion_matrix – whether to generate confusion matrix
normalize_confusion_matrix – whether to normalize confusion matrix
batch_size – batch size to use during inference.
-
property
input_types
¶ Define these to enable input neural type checks
-
classmethod
list_available_models
() → Optional[nemo.core.classes.common.PretrainedModelInfo][source]¶ This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.
- Returns
List of available pre-trained models.
-
property
output_types
¶ Define these to enable output neural type checks
-
setup_loss
(class_balancing: Optional[str] = None)[source]¶ - Setup loss
Setup or update loss.
- Parameters
class_balancing – whether to use class weights during training
-
setup_test_data
(test_data_config: Optional[omegaconf.DictConfig] = None)[source]¶ (Optionally) Setups data loader to be used in test
- Parameters
test_data_layer_config – test data layer parameters.
Returns:
-
setup_training_data
(train_data_config: Optional[omegaconf.DictConfig] = None)[source]¶ Setups data loader to be used in training
- Parameters
train_data_layer_config – training data layer parameters.
Returns:
-
setup_validation_data
(val_data_config: Optional[omegaconf.DictConfig] = None)[source]¶ Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.
Returns:
-
test_epoch_end
(outputs)[source]¶ Default DataLoader for Test set which automatically supports multiple data loaders via multi_test_epoch_end.
If multi dataset support is not required, override this method entirely in base class. In such a case, there is no need to implement multi_test_epoch_end either.
Note
If more than one data loader exists, and they all provide test_loss, only the test_loss of the first data loader will be used by default. This default can be changed by passing the special key test_dl_idx: int inside the test_ds config.
- Parameters
outputs – Single or nested list of tensor outputs from one or more data loaders.
- Returns
A dictionary containing the union of all items from individual data_loaders, along with merged logs from all data loaders.
-
training_step
(batch, batch_idx)[source]¶ Lightning calls this inside the training loop with the data from the training dataloader passed in as batch.
-
update_data_dir
(data_dir: str) → None[source]¶ Update data directory and get data stats with Data Descriptor Weights are later used to setup loss
- Parameters
data_dir – path to data directory
-
-
class
nemo.collections.nlp.models.
QAModel
(*args: Any, **kwargs: Any)[source]¶ Bases:
pytorch_lightning.
,nemo.core.classes.common.Model
BERT encoder with QA head training.
-
inference
(file: str, batch_size: int = 1, num_samples: int = - 1, output_nbest_file: Optional[str] = None, output_prediction_file: Optional[str] = None)¶ Get prediction for unlabeled inference data
- Parameters
file – inference data
batch_size – batch size to use during inference
num_samples – number of samples to use of inference data. Default: -1 if all data should be used.
output_nbest_file – optional output file for writing out nbest list
output_prediction_file – optional output file for writing out predictions
- Returns
model predictions, model nbest list
-
property
input_types
¶ Define these to enable input neural type checks
-
classmethod
list_available_models
() → Optional[nemo.core.classes.common.PretrainedModelInfo][source]¶ This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.
- Returns
List of available pre-trained models.
-
property
output_types
¶ Define these to enable output neural type checks
-
setup_test_data
(test_data_config: Optional[omegaconf.DictConfig])[source]¶ (Optionally) Setups data loader to be used in test
- Parameters
test_data_layer_config – test data layer parameters.
Returns:
-
setup_training_data
(train_data_config: Optional[omegaconf.DictConfig])[source]¶ Setups data loader to be used in training
- Parameters
train_data_layer_config – training data layer parameters.
Returns:
-
setup_validation_data
(val_data_config: Optional[omegaconf.DictConfig])[source]¶ Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.
Returns:
-
test_epoch_end
(outputs)[source]¶ Default DataLoader for Test set which automatically supports multiple data loaders via multi_test_epoch_end.
If multi dataset support is not required, override this method entirely in base class. In such a case, there is no need to implement multi_test_epoch_end either.
Note
If more than one data loader exists, and they all provide test_loss, only the test_loss of the first data loader will be used by default. This default can be changed by passing the special key test_dl_idx: int inside the test_ds config.
- Parameters
outputs – Single or nested list of tensor outputs from one or more data loaders.
- Returns
A dictionary containing the union of all items from individual data_loaders, along with merged logs from all data loaders.
-
validation_epoch_end
(outputs)[source]¶ Default DataLoader for Validation set which automatically supports multiple data loaders via multi_validation_epoch_end.
If multi dataset support is not required, override this method entirely in base class. In such a case, there is no need to implement multi_validation_epoch_end either.
Note
If more than one data loader exists, and they all provide val_loss, only the val_loss of the first data loader will be used by default. This default can be changed by passing the special key val_dl_idx: int inside the validation_ds config.
- Parameters
outputs – Single or nested list of tensor outputs from one or more data loaders.
- Returns
A dictionary containing the union of all items from individual data_loaders, along with merged logs from all data loaders.
-
-
class
nemo.collections.nlp.models.
BERTLMModel
(*args: Any, **kwargs: Any)[source]¶ Bases:
pytorch_lightning.
,nemo.core.classes.common.Model
BERT language model pretraining.
-
forward
(input_ids, token_type_ids, attention_mask)[source]¶ No special modification required for Lightning, define it as you normally would in the nn.Module in vanilla PyTorch.
-
property
input_types
¶ Define these to enable input neural type checks
-
classmethod
list_available_models
() → Optional[nemo.core.classes.common.PretrainedModelInfo][source]¶ This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.
- Returns
List of available pre-trained models.
-
property
output_types
¶ Define these to enable output neural type checks
-
setup_test_data
(test_data_config: Optional[omegaconf.DictConfig])[source]¶ (Optionally) Setups data loader to be used in test
- Parameters
test_data_layer_config – test data layer parameters.
Returns:
-
setup_training_data
(train_data_config: Optional[omegaconf.DictConfig])[source]¶ Setups data loader to be used in training
- Parameters
train_data_layer_config – training data layer parameters.
Returns:
-
setup_validation_data
(val_data_config: Optional[omegaconf.DictConfig])[source]¶ Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.
Returns:
-
training_step
(batch, batch_idx)[source]¶ Lightning calls this inside the training loop with the data from the training dataloader passed in as batch.
-
Modules¶
-
class
nemo.collections.nlp.modules.
BertModule
(*args: Any, **kwargs: Any)[source]¶ Bases:
torch.nn.
,nemo.core.classes.common.Typing
,nemo.core.classes.common.Serialization
,nemo.core.classes.common.FileIO
-
input_example
()[source]¶ Generates input examples for tracing etc. :returns: A tuple of input examples.
-
property
input_types
¶ Define these to enable input neural type checks
-
property
output_types
¶ Define these to enable output neural type checks
-
-
class
nemo.collections.nlp.modules.
MegatronBertEncoder
(*args: Any, **kwargs: Any)[source]¶ Bases:
torch.nn.
,nemo.core.classes.common.Typing
,nemo.core.classes.common.Serialization
,nemo.core.classes.common.FileIO
MegatronBERT wraps around the Megatron Language model from https://github.com/NVIDIA/Megatron-LM
- Parameters
config_file (str) – path to model configuration file.
vocab_file (str) – path to vocabulary file.
tokenizer_type (str) – tokenizer type, currently only ‘BertWordPieceLowerCase’ supported.
Property returning hidden size.
- Returns
Hidden size.
-
class
nemo.collections.nlp.modules.
AlbertEncoder
(*args: Any, **kwargs: Any)[source]¶ Bases:
transformers.
,nemo.collections.nlp.modules.common.bert_module.BertModule
Wraps around the Huggingface transformers implementation repository for easy use within NeMo.
-
class
nemo.collections.nlp.modules.
BertEncoder
(*args: Any, **kwargs: Any)[source]¶ Bases:
transformers.
,nemo.collections.nlp.modules.common.bert_module.BertModule
Wraps around the Huggingface transformers implementation repository for easy use within NeMo.
-
class
nemo.collections.nlp.modules.
DistilBertEncoder
(*args: Any, **kwargs: Any)[source]¶ Bases:
transformers.
,nemo.collections.nlp.modules.common.bert_module.BertModule
Wraps around the Huggingface transformers implementation repository for easy use within NeMo.
-
class
nemo.collections.nlp.modules.
RobertaEncoder
(*args: Any, **kwargs: Any)[source]¶ Bases:
transformers.
,nemo.collections.nlp.modules.common.bert_module.BertModule
Wraps around the Huggingface transformers implementation repository for easy use within NeMo.
-
class
nemo.collections.nlp.modules.
SequenceClassifier
(*args: Any, **kwargs: Any)[source]¶ Bases:
torch.nn.
,nemo.core.classes.common.Typing
,nemo.core.classes.common.Serialization
,nemo.core.classes.common.FileIO
-
property
output_types
¶ Define these to enable output neural type checks
-
property
-
class
nemo.collections.nlp.modules.
SequenceRegression
(*args: Any, **kwargs: Any)[source]¶ Bases:
torch.nn.
,nemo.core.classes.common.Typing
,nemo.core.classes.common.Serialization
,nemo.core.classes.common.FileIO
- Parameters
hidden_size – the hidden size of the mlp head on the top of the encoder
num_layers – number of the linear layers of the mlp head on the top of the encoder
activation – type of activations between layers of the mlp head
dropout – the dropout used for the mlp head
use_transformer_init – initializes the weights with the same approach used in Transformer
idx_conditioned_on – index of the token to use as the sequence representation for the classification task, default is the first token
-
forward
(hidden_states: torch.Tensor) → torch.Tensor[source]¶ Forward pass through the module.
- Parameters
hidden_states – hidden states for each token in a sequence, for example, BERT module output
-
property
output_types
¶ Define these to enable output neural type checks
-
class
nemo.collections.nlp.modules.
SequenceTokenClassifier
(*args: Any, **kwargs: Any)[source]¶ Bases:
torch.nn.
,nemo.core.classes.common.Typing
,nemo.core.classes.common.Serialization
,nemo.core.classes.common.FileIO
-
property
output_types
¶ Define these to enable output neural type checks
-
property
-
nemo.collections.nlp.modules.
get_lm_model
(pretrained_model_name: str, config_dict: Optional[dict] = None, config_file: Optional[str] = None, checkpoint_file: Optional[str] = None, vocab_file: Optional[str] = None) → nemo.collections.nlp.modules.common.bert_module.BertModule[source]¶ Helper function to instantiate a language model encoder, either from scratch or a pretrained model. If only pretrained_model_name are passed, a pretrained model is returned. If a configuration is passed, whether as a file or dictionary, the model is initialized with random weights.
- Parameters
pretrained_model_name – pretrained model name, for example, bert-base-uncased or megatron-bert-cased. See get_pretrained_lm_models_list() for full list.
config_dict – path to the model configuration dictionary
config_file – path to the model configuration file
checkpoint_file – path to the pretrained model checkpoint
vocab_file – path to vocab_file to be used with Megatron-LM
- Returns
Pretrained BertModule