NeMo NLP collection API

Model Classes

Modules

class nemo.collections.nlp.modules.BertModule(*args: Any, **kwargs: Any)[source]

Bases: nemo.core.classes.module.NeuralModule, nemo.core.classes.exportable.Exportable

input_example()[source]

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable input neural type checks

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

restore_weights(restore_path: str)[source]

Restores module/model’s weights

class nemo.collections.nlp.modules.common.megatron.MegatronBertEncoder(*args: Any, **kwargs: Any)[source]

Bases: nemo.collections.nlp.modules.common.bert_module.BertModule

MegatronBERT wraps around the Megatron Language model from https://github.com/NVIDIA/Megatron-LM

Parameters
  • config_file (str) – path to model configuration file.

  • vocab_file (str) – path to vocabulary file.

  • tokenizer_type (str) – tokenizer type, currently only ‘BertWordPieceLowerCase’ supported.

forward(input_ids, attention_mask, token_type_ids=None)[source]
property hidden_size

Property returning hidden size.

Returns

Hidden size.

restore_weights(restore_path: str)[source]
Restores module/model’s weights.

For model parallel checkpoints the directory structure should be restore_path/mp_rank_0X/model_optim_rng.pt

Parameters

restore_path (str) – restore_path should a file or a directory if using model parallel

property vocab_size

Property returning vocab size.

Returns

vocab size.

class nemo.collections.nlp.modules.AlbertEncoder(*args: Any, **kwargs: Any)[source]

Bases: transformers.AlbertModel, nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, attention_mask, token_type_ids)[source]
class nemo.collections.nlp.modules.BertEncoder(*args: Any, **kwargs: Any)[source]

Bases: transformers.BertModel, nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, attention_mask=None, token_type_ids=None)[source]
class nemo.collections.nlp.modules.DistilBertEncoder(*args: Any, **kwargs: Any)[source]

Bases: transformers.DistilBertModel, nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, attention_mask, token_type_ids=None)[source]
class nemo.collections.nlp.modules.RobertaEncoder(*args: Any, **kwargs: Any)[source]

Bases: transformers.RobertaModel, nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, token_type_ids, attention_mask)[source]
class nemo.collections.nlp.modules.SequenceClassifier(*args: Any, **kwargs: Any)[source]

Bases: nemo.collections.nlp.modules.common.classifier.Classifier

forward(hidden_states)[source]
property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

class nemo.collections.nlp.modules.SequenceRegression(*args: Any, **kwargs: Any)[source]

Bases: nemo.collections.nlp.modules.common.classifier.Classifier

Parameters
  • hidden_size – the hidden size of the mlp head on the top of the encoder

  • num_layers – number of the linear layers of the mlp head on the top of the encoder

  • activation – type of activations between layers of the mlp head

  • dropout – the dropout used for the mlp head

  • use_transformer_init – initializes the weights with the same approach used in Transformer

  • idx_conditioned_on – index of the token to use as the sequence representation for the classification task, default is the first token

forward(hidden_states: torch.Tensor) torch.Tensor[source]

Forward pass through the module.

Parameters

hidden_states – hidden states for each token in a sequence, for example, BERT module output

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

class nemo.collections.nlp.modules.SequenceTokenClassifier(*args: Any, **kwargs: Any)[source]

Bases: nemo.collections.nlp.modules.common.classifier.Classifier

forward(hidden_states)[source]
property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

nemo.collections.nlp.modules.get_lm_model(pretrained_model_name: str, config_dict: Optional[dict] = None, config_file: Optional[str] = None, checkpoint_file: Optional[str] = None, vocab_file: Optional[str] = None) nemo.collections.nlp.modules.common.bert_module.BertModule[source]

Helper function to instantiate a language model encoder, either from scratch or a pretrained model. If only pretrained_model_name are passed, a pretrained model is returned. If a configuration is passed, whether as a file or dictionary, the model is initialized with random weights.

Parameters
  • pretrained_model_name – pretrained model name, for example, bert-base-uncased or megatron-bert-cased. See get_pretrained_lm_models_list() for full list.

  • config_dict – path to the model configuration dictionary

  • config_file – path to the model configuration file

  • checkpoint_file – path to the pretrained model checkpoint

  • vocab_file – path to vocab_file to be used with Megatron-LM

Returns

Pretrained BertModule

nemo.collections.nlp.modules.get_pretrained_lm_models_list(include_external: bool = False) List[str][source]

Returns the list of supported pretrained model names

Parameters
  • names (include_external if true includes all HuggingFace model) –

  • NeMo. (not only those supported language models in) –

nemo.collections.nlp.modules.common.megatron.get_megatron_lm_models_list() List[str][source]

Returns the list of supported Megatron-LM models

Datasets

class nemo.collections.nlp.data.token_classification.punctuation_capitalization_dataset.BertPunctuationCapitalizationDataset(*args: Any, **kwargs: Any)[source]

Bases: nemo.core.classes.dataset.Dataset

A dataset to use during training for punctuation and capitalization tasks. For inference, you will need BertPunctuationCapitalizationInferDataset. For huge datasets which cannot be loaded into memory simultaneously use BertPunctuationCapitalizationTarredDataset.

Parameters
  • text_file (Union[str, os.PathLike]) – a path to a file with sequences, each line should contain a text without punctuation and capitalization

  • labels_file (Union[str, os.PathLike]) – a path to a file with labels, each line corresponds to word labels for a sentence in the text_file. Labels have to follow format described in this section of documentation NeMo Data Format.

  • max_seq_length (int) – max number of tokens in a source sequence. max_seq_length includes for [CLS] and [SEP] tokens. Sequences which are too long will be clipped by removal of tokens from the end of the sequence.

  • tokenizer (TokenizerSpec) – a tokenizer instance which has properties unk_id, sep_id, bos_id, eos_id.

  • num_samples (int, optional, defaults to -1) – a number of samples you want to use for the dataset. If -1, use all dataset. Useful for testing.

  • tokens_in_batch (int, optional, defaults to 5000) – number of tokens in a batch including paddings and special tokens ([CLS], [SEP], [UNK]). This class __getitem__() method returns not samples but ready batches. Number of samples in a batch is adjusted for input sequences lengths. If input sequences are short, then a batch will contain more samples. Before packing into batches, samples are sorted by number of tokens they contain. Sorting allows to reduce number of pad tokens in a batch significantly. Regular PyTorch data loader shuffling will only permute batches with changing their content. Proper shuffling is achieved via calling method repack_batches_with_shuffle() every epoch.

  • pad_label (str, optional, defaults to 'O') – pad value to use for labels. It’s also the neutral label both for punctuation and capitalization.

  • punct_label_ids (Dict[str, int], optional) – dict to map punctuation labels to label ids. For dev set, use label ids generated during training to support cases when not all labels are present in the dev set. For training, it is recommended to set punct_label_ids to None or load from cache.

  • capit_label_ids (Dict[str, int], optional) – same punct_label_ids for capitalization labels.

  • ignore_extra_tokens (bool, optional, defaults to False) – whether to compute loss on tokens which are not first tokens in a word. For example, assume that word 'tokenization' is tokenized into ['token', 'ization']. If ignore_extra_tokens=True, loss mask for the word is [True, False], and if ignore_extra_tokens=False, then loss mask is [True, True].

  • ignore_start_end (bool, optional, defaults to True) – whether to ignore [CLS] and [SEP] tokens in the loss_mask.

  • use_cache (bool, optional, defaults to True) –

    whether to use pickled features or not. If pickled features does not exist and use_cache=True, then pickled features will be created. Pickled features are looked for and stored in cache_dir. Pickled features include input ids, subtokens mask (mask of first tokens in words), encoded punctuation and capitalization labels, label ids. Features creation consumes considerable time and this use_cache=True significantly speeds up training starting.

    Warning

    If you spawned more then 1 processes BEFORE dataset creation, then the use_cache parameter has to be True. In PyTorch Lightning spawning is performed when Trainer.fit() or Trainer.test() are called.

  • cache_dir (Union[str, os.PathLike], optional) – a path to a directory where cache (pickled features) is stored. By default, text_file parent directory is used. This parameter is useful if dataset directory is read-only and you wish to pickle features. In such a case specify a path to directory which allows writing in cache_dir parameter.

  • get_label_frequencies (bool, optional, defaults to False) – whether to print and save label frequencies. Frequencies are showed if verbose parameter is True. If get_label_frequencies=True, then frequencies are saved into label_info_save_dir directory.

  • label_info_save_dir (Union[str, os.PathLike], optional) – a path to a directory where label frequencies are saved. Be default a text_file parent directory is used. When method save_labels_and_get_file_paths() is called label ids are saved into label_info_save_dir directory. Parameters cache_dir and label_info_save_dir are added for cases when directory containing. This parameter is useful if directory containing text_file is read-only.

  • punct_label_vocab_file (Union[str, os.PathLike], optional) – a path to a .csv file containing punctuation label vocabulary. Each line in such a vocabulary file contains exactly one label. The first line has to contain pad_label, otherwise error will be raised.

  • capit_label_vocab_file (Union[str, os.PathLike], optional) – same as punct_label_vocab_file for capitalization labels.

  • add_masks_and_segment_ids_to_batch (bool, optional, defaults to True) – whether to add 'loss_mask', 'input_mask', 'segment_ids' items to a batch. Useful for creation of tarred dataset and can NOT be used during model training and inference.

  • verbose (bool, optional, defaults to True) – whether to show data examples, label stats and other useful information.

  • n_jobs (int, optional, defaults to 0) –

    number of workers used for tokenization, encoding labels, creating “first token in word” mask, and clipping. If n_jobs <= 0 data preparation is performed without multiprocessing. By default n_jobs is equal to the number of CPUs.

    Warning

    There can be deadlocking problems with some tokenizers (e.g. SentencePiece, HuggingFace AlBERT) if n_jobs > 0.

  • tokenization_progress_queue (multiprocessing.Queue, optional) – a queue for reporting tokenization progress. Useful for creation of tarred dataset

  • batch_mark_up_progress_queue (multiprocessing.Queue, optional) – a queue for reporting progress in deciding which samples batches will contain. Useful for creation of tarred dataset

  • batch_building_progress_queue (multiprocessing.Queue, optional) – a queue for reporting progress in batch creation (stacking and padding). Useful for creation of tarred dataset

__getitem__(idx: int) Dict[str, Union[Sequence[Sequence[Sequence[Sequence[Sequence[Any]]]]], numpy.typing._array_like._SupportsArray[numpy.dtype], Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]], Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]], Sequence[Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]]], bool, int, float, complex, str, bytes, Sequence[Union[bool, int, float, complex, str, bytes]], Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]], Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]], Sequence[Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]]]]][source]

Return a batch with index idx. The values of a batch dictionary are numpy arrays of identical shapes [Batch, Time]. Labels are identical for all tokens in a word. For example, if

  • word 'Tokenization' is tokenized into tokens ['token', 'ization'],

  • it is followed by comma,

then punctuation labels are [',', ','] and capitalization labels are ['U', 'U'] ('U' is a label for words which start with upper case character).

Parameters

idx – an index of returned batch

Returns

a dictionary with items:

  • 'input_ids' (numpy.ndarray): numpy.int32 array containing encoded tokens,

  • 'subtokens_mask' (numpy.ndarray): bool array which elements are True if they correspond to first token in a word,

  • 'punct_labels' (numpy.ndarray): numpy.int32 array containing encoded punctuation labels,

  • 'capit_labels' (numpy.ndarray): numpy.int32 array containing encoded capitalization labels.

  • 'segment_ids' (numpy.ndarray): numpy.int8 array filled with zeros (BERT token types in HuggingFace terminology) (if self.add_masks_and_segment_ids_to_batch is False, then this items is missing),

  • 'input_mask' (numpy.ndarray): bool array which elements are True if corresponding token is not a padding token (if self.add_masks_and_segment_ids_to_batch is False, then this items is missing),

  • 'loss_mask' (numpy.ndarray): bool array which elements are True if loss is computed for corresponding token. See more in description of constructor parameters ignore_start_end, ignore_extra_tokens (if self.add_masks_and_segment_ids_to_batch is False, then this items is missing).

Return type

Dict[str, ArrayLike]

collate_fn(batches: List[Dict[str, Union[Sequence[Sequence[Sequence[Sequence[Sequence[Any]]]]], numpy.typing._array_like._SupportsArray[numpy.dtype], Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]], Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]], Sequence[Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]]], bool, int, float, complex, str, bytes, Sequence[Union[bool, int, float, complex, str, bytes]], Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]], Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]], Sequence[Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]]]]]]) Dict[str, torch.Tensor][source]

Return zeroth batch from batches list passed for collating and casts 'segment_ids', 'punct_labels', 'capit_labels' to types supported by PunctuationCapitalizationModel. All output tensors have shape [Batch, Time].

Warning

A batch_size parameter of a PyTorch data loader and sampler has to be 1.

Parameters

batches (List[Dict[str, ArrayLike]]) – a list containing 1 batch passed for collating

Returns

a batch dictionary with following items (for detailed description of batch items see method __getitem__()):

Return type

Dict[str, torch.Tensor]

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Returns definitions of module output ports.

repack_batches_with_shuffle() None[source]

A function for proper shuffling of a dataset. Pytorch data loader shuffing will only permute batches.

save_labels_and_get_file_paths(punct_labels_file_name: str, capit_labels_file_name: str) Tuple[pathlib.Path, pathlib.Path][source]

Saves label ids into files located in self.label_info_save_dir. Saved label ids are usually used for .nemo checkpoint creation.

The signatures of this method and the signature of the method save_labels_and_get_file_paths() must be identical.

Parameters
  • punct_labels_file_name (str) – a name of a punctuation labels file

  • capit_labels_file_name (str) – a name of a capitalization labels file

Returns

a tuple containing:

  • pathlib.Path: a path to the saved punctuation labels file

  • pathlib.Path: a path to the saved capitalization labels file

Return type

Tuple[pathlib.Path, pathlib.Path]

class nemo.collections.nlp.data.token_classification.punctuation_capitalization_infer_dataset.BertPunctuationCapitalizationInferDataset(*args: Any, **kwargs: Any)[source]

Bases: nemo.core.classes.dataset.Dataset

Creates dataset to use during inference for punctuation and capitalization tasks with a pretrained model. For dataset to use during training with labels, see BertPunctuationCapitalizationDataset and BertPunctuationCapitalizationTarredDataset.

Parameters max_seq_length, step, margin are for controlling the way queries are split into segments which then processed by the model. Parameter max_seq_length is a length of a segment after tokenization including special tokens [CLS] in the beginning and [SEP] in the end of a segment. Parameter step is shift between consequent segments. Parameter margin is used to exclude negative effect of subtokens near borders of segments which have only one side context.

Parameters
  • queries (List[str]) – list of sequences.

  • tokenizer (TokenizerSpec) – a tokenizer which was used for model training. It should have properties cls_id, sep_id, unk_id, pad_id.

  • max_seq_length (int, optional, defaults to 128) – max sequence length which includes [CLS] and [SEP] tokens

  • step (int, optional, defaults to 8) – relative shift of consequent segments into which long queries are split. Long queries are split into segments which can overlap. Parameter step controls such overlapping. Imagine that queries are tokenized into characters, max_seq_length=5, and step=2. In such a case query “hello” is tokenized into segments [['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']].

  • margin (int, optional, defaults to 16) – number of subtokens in the beginning and the end of segments which are not used for prediction computation. The first segment does not have left margin and the last segment does not have right margin. For example, if input sequence is tokenized into characters, max_seq_length=5, step=1, and margin=1, then query “hello” will be tokenized into segments [['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'e', 'l', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']]. These segments are passed to the model. Before final predictions computation, margins are removed. In the next list, subtokens which logits are not used for final predictions computation are marked with asterisk: [['[CLS]'*, 'h', 'e', 'l'*, '[SEP]'*], ['[CLS]'*, 'e'*, 'l', 'l'*, '[SEP]'*], ['[CLS]'*, 'l'*, 'l', 'o', '[SEP]'*]].

__getitem__(idx: int) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, int, int, bool, bool][source]

Returns batch used for punctuation and capitalization inference.

Parameters

idx (int) – a batch index

Returns

a tuple containing:

  • input_ids (np.ndarray): an integer numpy array of shape [Time]. Ids of word subtokens encoded using tokenizer passed in constructor tokenizer parameter.

  • segment_ids (np.ndarray): an integer zeros numpy array of shape [Time]. Indices of segments for BERT model (token types in HuggingFace terminology).

  • input_mask (np.ndarray): a boolean numpy array of shape [Time]. An element of this array is True if corresponding token is not padding token.

  • subtokens_mask (np.ndarray): a boolean numpy array of shape [Time]. An element equals True if corresponding token is the first token in a word and False otherwise. For example, if input query "language processing" is tokenized into ["[CLS]", "language", "process", "ing", "SEP"], then subtokens_mask will be [False, True, True, False, False].

  • quantities_of_preceding_words (int): a number of words preceding current segment in the query to which the segment belongs. This parameter is used for uniting predictions from adjacent segments.

  • query_ids (int): an index of query to which the segment belongs

  • is_first (bool): whether a segment is the first segment in a query. The left margin of the first segment in a query is not removed.

  • is_last (bool): whether a query is the last query in a query. The right margin of the last segment in a query is not removed.

Return type

Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, int, int, bool, bool]

collate_fn(batch: List[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, int, int, bool, bool]]) Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, Tuple[int, ...], Tuple[int, ...], Tuple[bool, ...], Tuple[bool, ...]][source]

Collates samples into batches.

Parameters

batch (List[tuple]) – a list of samples returned by __getitem__() method.

Returns

a tuple containing 8 elements:

  • input_ids (torch.Tensor): an integer tensor of shape [Batch, Time] containing encoded input text.

  • segment_ids (torch.Tensor): an integer tensor of shape [Batch, Time] filled with zeros.

  • input_mask (torch.Tensor): a boolean tensor of shape [Batch, Time] which elements are True if corresponding token is not a padding token.

  • subtokens_mask (torch.Tensor): a boolean tensor of shape [Batch, Time] which elements are True if corresponding tken is the first token in a word.

  • quantities_of_preceding_words (Tuple[int, ...]): a tuple containing number of words in a query preceding current segment.

  • query_ids (Tuple[int, ...]): a tuple containing indices of queries to which segments belong.

  • is_first (Tuple[bool, ...]): a tuple booleans which elements are True if corresponding segment is the first segment in a query.

  • is_last (Tuple[bool, ...]): a tuple of booleans which elements are True if corresponding segment is the last segment in a query.

Return type

Tuple[torch.Tensor (x4), Tuple[int, ...] (x2), Tuple[bool, ...] (x2)]

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Returns neural types of collate_fn() output.