NeMo NLP collection API#

Model Classes#

Modules#

class nemo.collections.nlp.modules.BertModule(*args: Any, **kwargs: Any)[source]#

Bases: nemo.core.classes.module.NeuralModule, nemo.core.classes.exportable.Exportable

input_example(max_batch=1, max_dim=256)[source]#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]#

Define these to enable input neural type checks

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]#

Define these to enable output neural type checks

restore_weights(restore_path: str)[source]#

Restores module/model’s weights

class nemo.collections.nlp.modules.AlbertEncoder(*args: Any, **kwargs: Any)[source]#

Bases: transformers.AlbertModel, nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, attention_mask, token_type_ids)[source]#
class nemo.collections.nlp.modules.BertEncoder(*args: Any, **kwargs: Any)[source]#

Bases: transformers.BertModel, nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, attention_mask=None, token_type_ids=None)[source]#
class nemo.collections.nlp.modules.DistilBertEncoder(*args: Any, **kwargs: Any)[source]#

Bases: transformers.DistilBertModel, nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, attention_mask, token_type_ids=None)[source]#
class nemo.collections.nlp.modules.RobertaEncoder(*args: Any, **kwargs: Any)[source]#

Bases: transformers.RobertaModel, nemo.collections.nlp.modules.common.bert_module.BertModule

Wraps around the Huggingface transformers implementation repository for easy use within NeMo.

forward(input_ids, attention_mask, token_type_ids)[source]#
class nemo.collections.nlp.modules.SequenceClassifier(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.nlp.modules.common.classifier.Classifier

forward(hidden_states)[source]#
property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]#

Define these to enable output neural type checks

class nemo.collections.nlp.modules.SequenceRegression(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.nlp.modules.common.classifier.Classifier

Parameters
  • hidden_size – the hidden size of the mlp head on the top of the encoder

  • num_layers – number of the linear layers of the mlp head on the top of the encoder

  • activation – type of activations between layers of the mlp head

  • dropout – the dropout used for the mlp head

  • use_transformer_init – initializes the weights with the same approach used in Transformer

  • idx_conditioned_on – index of the token to use as the sequence representation for the classification task, default is the first token

forward(hidden_states: torch.Tensor) torch.Tensor[source]#

Forward pass through the module.

Parameters

hidden_states – hidden states for each token in a sequence, for example, BERT module output

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]#

Define these to enable output neural type checks

class nemo.collections.nlp.modules.SequenceTokenClassifier(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.nlp.modules.common.classifier.Classifier

forward(hidden_states)[source]#
property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]#

Define these to enable output neural type checks

nemo.collections.nlp.modules.get_lm_model(config_dict: Optional[dict] = None, config_file: Optional[str] = None, vocab_file: Optional[str] = None, trainer: Optional[pytorch_lightning.Trainer] = None, cfg: Optional[omegaconf.DictConfig] = None) nemo.collections.nlp.modules.common.bert_module.BertModule[source]#

Helper function to instantiate a language model encoder, either from scratch or a pretrained model. If only pretrained_model_name are passed, a pretrained model is returned. If a configuration is passed, whether as a file or dictionary, the model is initialized with random weights.

Parameters
  • config_dict – path to the model configuration dictionary

  • config_file – path to the model configuration file

  • vocab_file – path to vocab_file to be used with Megatron-LM

  • trainer – an instance of a PyTorch Lightning trainer

  • cfg – a model configuration

Returns

Pretrained BertModule

nemo.collections.nlp.modules.get_pretrained_lm_models_list(include_external: bool = False) List[str][source]#

Returns the list of supported pretrained model names

Parameters
  • names (include_external if true includes all HuggingFace model) –

  • NeMo. (not only those supported language models in) –

nemo.collections.nlp.modules.common.megatron.get_megatron_lm_models_list() List[str][source]#

Returns the list of supported Megatron-LM models

Datasets#

class nemo.collections.nlp.data.token_classification.punctuation_capitalization_dataset.BertPunctuationCapitalizationDataset(*args: Any, **kwargs: Any)[source]#

Bases: nemo.core.classes.dataset.Dataset

A dataset to use during training for punctuation and capitalization tasks. For inference, you will need BertPunctuationCapitalizationInferDataset. For huge datasets which cannot be loaded into memory simultaneously use BertPunctuationCapitalizationTarredDataset.

Parameters
  • text_file (Union[str, os.PathLike]) – a path to a file with sequences, each line should contain a text without punctuation and capitalization

  • labels_file (Union[str, os.PathLike]) – a path to a file with labels, each line corresponds to word labels for a sentence in the text_file. Labels have to follow format described in this section of documentation NeMo Data Format.

  • max_seq_length (int) – max number of tokens in a source sequence. max_seq_length includes for [CLS] and [SEP] tokens. Sequences which are too long will be clipped by removal of tokens from the end of the sequence.

  • tokenizer (TokenizerSpec) – a tokenizer instance which has properties unk_id, sep_id, bos_id, eos_id.

  • num_samples (int, optional, defaults to -1) – a number of samples you want to use for the dataset. If -1, use all dataset. Useful for testing.

  • tokens_in_batch (int, optional, defaults to 5000) – number of tokens in a batch including paddings and special tokens ([CLS], [SEP], [UNK]). This class __getitem__() method returns not samples but ready batches. Number of samples in a batch is adjusted for input sequences lengths. If input sequences are short, then a batch will contain more samples. Before packing into batches, samples are sorted by number of tokens they contain. Sorting allows to reduce number of pad tokens in a batch significantly. Regular PyTorch data loader shuffling will only permute batches with changing their content. Proper shuffling is achieved via calling method repack_batches_with_shuffle() every epoch. If parameter number_of_batches_is_multiple_of is greater than 1, some batches may be split into smaller pieces.

  • pad_label (str, optional, defaults to 'O') – pad value to use for labels. It’s also the neutral label both for punctuation and capitalization.

  • punct_label_ids (Dict[str, int], optional) – dict to map punctuation labels to label ids. For dev set, use label ids generated during training to support cases when not all labels are present in the dev set. For training, it is recommended to set punct_label_ids to None or load from cache.

  • capit_label_ids (Dict[str, int], optional) – same punct_label_ids for capitalization labels.

  • ignore_extra_tokens (bool, optional, defaults to False) – whether to compute loss on tokens which are not first tokens in a word. For example, assume that word 'tokenization' is tokenized into ['token', 'ization']. If ignore_extra_tokens=True, loss mask for the word is [True, False], and if ignore_extra_tokens=False, then loss mask is [True, True].

  • ignore_start_end (bool, optional, defaults to True) – whether to ignore [CLS] and [SEP] tokens in the loss_mask.

  • use_cache (bool, optional, defaults to True) – whether to use pickled features already present in cache_dir or not. If pickled features file does not exist or use_cache=False, then features are pickled in cache_dir. Pickled features include input ids, subtokens mask (mask of first tokens in words), encoded punctuation and capitalization labels, label ids. Features creation consumes considerable time and this use_cache=True significantly speeds up training starting. Pickled features are also used for sharing features between processes if data parallel training is used.

  • cache_dir (Union[str, os.PathLike], optional) – a path to a directory where cache (pickled features) is stored. By default, text_file parent directory is used. This parameter is useful if dataset directory is read-only and you wish to pickle features. In such a case specify a path to directory which allows writing in cache_dir parameter.

  • get_label_frequencies (bool, optional, defaults to False) – whether to print and save label frequencies. Frequencies are showed if verbose parameter is True. If get_label_frequencies=True, then frequencies are saved into label_info_save_dir directory.

  • label_info_save_dir (Union[str, os.PathLike], optional) – a path to a directory where label frequencies are saved. Be default a text_file parent directory is used. When method save_labels_and_get_file_paths() is called label ids are saved into label_info_save_dir directory. This parameter is useful if directory containing text_file is read-only.

  • punct_label_vocab_file (Union[str, os.PathLike], optional) – a path to a .csv file containing punctuation label vocabulary. Each line in such a vocabulary file contains exactly one label. The first line has to contain pad_label, otherwise error will be raised.

  • capit_label_vocab_file (Union[str, os.PathLike], optional) – same as punct_label_vocab_file for capitalization labels.

  • add_masks_and_segment_ids_to_batch (bool, optional, defaults to True) – whether to add 'loss_mask', 'input_mask', 'segment_ids' items to a batch. Useful for creation of tarred dataset and can NOT be used during model training and inference.

  • verbose (bool, optional, defaults to True) – whether to show data examples, label stats and other useful information.

  • n_jobs (int, optional, defaults to 0) –

    number of workers used for tokenization, encoding labels, creating “first token in word” mask, and clipping. If n_jobs <= 0 data preparation is performed without multiprocessing. By default n_jobs is 0.

    Warning

    There can be deadlocking problems with some tokenizers (e.g. SentencePiece, HuggingFace AlBERT) if n_jobs > 0.

  • number_of_batches_is_multiple_of (int, optional, defaults to 1) – number of batches in the dataset is made divisible by number_of_batches_is_multiple_of. If number_of_batches_is_multiple_of is greater than 1, then several batches are split in parts until number of batches is divisible by number_of_batches_is_multiple_of. If there is no enough queries in the dataset to create enough batches, then a warning is printed. This parameter is useful for dev and validation datasets if multiple GPUs are used. The problem is that if number of batches is not evenly divisible by number of GPUs, then some queries may be processed several times and metrics will be distorted.

  • batch_shuffling_random_seed (int, defaults to int) – a random seed used for batches repacking and shuffling.

  • tokenization_progress_queue (multiprocessing.Queue, optional) – a queue for reporting tokenization progress. Useful for creation of tarred dataset

  • batch_mark_up_progress_queue (multiprocessing.Queue, optional) – a queue for reporting progress in deciding which samples batches will contain. Useful for creation of tarred dataset

  • batch_building_progress_queue (multiprocessing.Queue, optional) – a queue for reporting progress in batch creation (stacking and padding). Useful for creation of tarred dataset

__getitem__(idx: int) Dict[str, numpy.ndarray][source]#

Return a batch with index idx. The values of a batch dictionary are numpy arrays of identical shapes [Batch, Time]. Labels are identical for all tokens in a word. For example, if

  • word 'Tokenization' is tokenized into tokens ['token', 'ization'],

  • it is followed by comma,

then punctuation labels are [',', ','] and capitalization labels are ['U', 'U'] ('U' is a label for words which start with upper case character).

Parameters

idx – an index of returned batch

Returns

a dictionary with items:

  • 'input_ids' (numpy.ndarray): numpy.int32 array containing encoded tokens,

  • 'subtokens_mask' (numpy.ndarray): bool array which elements are True if they correspond to first token in a word,

  • 'punct_labels' (numpy.ndarray): numpy.int32 array containing encoded punctuation labels,

  • 'capit_labels' (numpy.ndarray): numpy.int32 array containing encoded capitalization labels.

  • 'segment_ids' (numpy.ndarray): numpy.int8 array filled with zeros (BERT token types in HuggingFace terminology) (if self.add_masks_and_segment_ids_to_batch is False, then this items is missing),

  • 'input_mask' (numpy.ndarray): bool array which elements are True if corresponding token is not a padding token (if self.add_masks_and_segment_ids_to_batch is False, then this items is missing),

  • 'loss_mask' (numpy.ndarray): bool array which elements are True if loss is computed for corresponding token. See more in description of constructor parameters ignore_start_end, ignore_extra_tokens (if self.add_masks_and_segment_ids_to_batch is False, then this items is missing).

Return type

Dict[str, np.ndarray]

static calc_batch_seq_length(queries: List[numpy.ndarray], length_is_multiple_of: int) int[source]#
collate_fn(batches: List[Dict[str, numpy.ndarray]]) Dict[str, torch.Tensor][source]#

Return zeroth batch from batches list passed for collating and casts 'segment_ids', 'punct_labels', 'capit_labels' to types supported by PunctuationCapitalizationModel. All output tensors have shape [Batch, Time].

Warning

A batch_size parameter of a PyTorch data loader and sampler has to be 1.

Parameters

batches (List[Dict[str, np.ndarray]]) – a list containing 1 batch passed for collating

Returns

a batch dictionary with following items (for detailed description of batch items see method __getitem__()):

Return type

Dict[str, torch.Tensor]

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]#

Returns definitions of module output ports.

repack_batches_with_shuffle() None[source]#

A function for proper shuffling of a dataset. Pytorch data loader shuffing will only permute batches.

save_labels_and_get_file_paths(punct_labels_file_name: str, capit_labels_file_name: str) Tuple[pathlib.Path, pathlib.Path][source]#

Saves label ids into files located in self.label_info_save_dir. Saved label ids are usually used for .nemo checkpoint creation.

The signatures of this method and the signature of the method save_labels_and_get_file_paths() must be identical.

Parameters
  • punct_labels_file_name (str) – a name of a punctuation labels file

  • capit_labels_file_name (str) – a name of a capitalization labels file

Returns

a tuple containing:

  • pathlib.Path: a path to the saved punctuation labels file

  • pathlib.Path: a path to the saved capitalization labels file

Return type

Tuple[pathlib.Path, pathlib.Path]

nemo.collections.nlp.data.token_classification.punctuation_capitalization_tarred_dataset.create_tarred_dataset(text_file: Union[os.PathLike, str], labels_file: Union[os.PathLike, str], output_dir: Union[os.PathLike, str], max_seq_length: int, tokens_in_batch: int, lines_per_dataset_fragment: int, num_batches_per_tarfile: int, tokenizer_name: str, tokenizer_model: Optional[Union[str, os.PathLike]] = None, vocab_file: Optional[Union[str, os.PathLike]] = None, merges_file: Optional[Union[str, os.PathLike]] = None, special_tokens: Optional[Dict[str, str]] = None, use_fast_tokenizer: Optional[bool] = False, pad_label: str = 'O', punct_label_ids: Optional[Dict[str, int]] = None, capit_label_ids: Optional[Dict[str, int]] = None, punct_label_vocab_file: Optional[Union[str, os.PathLike]] = None, capit_label_vocab_file: Optional[Union[str, os.PathLike]] = None, tar_file_prefix: Optional[str] = 'punctuation_capitalization', n_jobs: Optional[int] = None) None[source]#

Creates tarred dataset from text_file and labels_file. A tarred dataset allows to train on large amounts of data without storing it all into memory simultaneously. You may use these function directly or try script examples/nlp/token_classification/data/create_punctuation_capitalization_tarred_dataset.py.

Tarred dataset is a directory which contains metadata file, tar files with batches, punct_label_vocab.csv and capit_label_vocab.csv files.

Metadata file is a JSON file with 4 items: 'num_batches', 'tar_files', 'punct_label_vocab_file', 'capit_label_vocab_file'. The item 'num_batches' (int) is a total number of batches in tarred dataset. 'tar_files' is a list of paths to tar files relative to directory containing the metadata file. The items 'punct_label_vocab_file' and 'capit_label_vocab_file' are correspondingly paths to punctuation and capitalization label vocabulary files. These paths are relative to directory containing the metadata file.

Every tar file contains objects written using webdataset.TarWriter. Each object is a dictionary with two items: '__key__' and 'batch.pyd'. '__key__' is a name of a batch and 'batch.pyd' is a pickled dictionary which contains 'input_ids', 'subtokens_mask', 'punct_labels', 'capit_labels'. 'input_ids' is an array containing ids of source tokens, 'subtokens_mask' is a boolean array showing first tokens in words, 'punct_labels' and 'capit_labels' are arrays with ids of labels.

Metadata file should be passed to constructor of BertPunctuationCapitalizationTarredDataset and the instance of the class will handle iteration and constructing masks and token types for BERT model.

Parameters
  • text_file (Union[os.PathLike, str]) – a path to a file with dataset source. Dataset source is lowercased text without punctuation. Number of lines in text_file has to be equal to the number of lines in labels_file.

  • labels_file (Union[os.PathLike, str]) – a path to a file with labels. Labels are given in the format described in NeMo Data Format.

  • output_dir (Union[os.PathLike, str]) – a path to a directory where metadata file, tar files and 'punct_label_ids.csv' and 'capit_label_ids.csv' files are saved.

  • max_seq_length (int) – Maximum number of subtokens in an input sequence. A source sequence which contains too many subtokens is clipped to max_seq_length - 2 subtokens and then [CLS] token is prepended to the clipped sequence and [SEP] token is appended to the clipped sequence. The clipping is performed via removal of subtokens in the end of a source sequence.

  • tokens_in_batch (int) – maximum number of tokens in a batch including [CLS], [SEP], [UNK], and [PAD] tokens. Before packing into batches source sequences are sorted by number of tokens in order to reduce number of pad tokens. So the number of samples in a batch may vary.

  • lines_per_dataset_fragment (int) – a number of lines processed by one worker during creation of tarred dataset. A worker tokenizes lines_per_dataset_fragment lines and keeps in RAM tokenized text labels before packing them into batches. Reducing lines_per_dataset_fragment leads to reducing of the amount of memory used by this function.

  • num_batches_per_tarfile (int) – a number of batches saved in a tar file. If you increase num_batches_per_tarfile, then there will be less tar files in the dataset. There cannot be less then num_batches_per_tarfile batches in a tar file, and all excess batches are removed. Maximum number of discarded batches is num_batches_per_tarfile - 1.

  • tokenizer_name (str) – a name of the tokenizer used for tokenization of source sequences. Possible options are 'sentencepiece', 'word', 'char', HuggingFace tokenizers. For more options see function nemo.collections.nlp.modules.common.get_tokenizer. The tokenizer must have properties cls_id, pad_id, sep_id, unk_id.

  • tokenizer_model (Union[os.PathLike, str], optional) – a path to a tokenizer model required for 'sentencepiece' tokenizer.

  • vocab_file (Union[os.PathLike, str], optional) – a path to a vocabulary file which can be used in 'word', 'char', and HuggingFace tokenizers.

  • merges_file (Union[os.PathLike, str], optional) – a path to merges file which can be used in HuggingFace tokenizers.

  • special_tokens (Dict[str, str], optional) – a dictionary with special tokens passed to constructors of 'char', 'word', 'sentencepiece', and various HuggingFace tokenizers.

  • use_fast_tokenizer (bool, optional, defaults to False) – whether to use fast HuggingFace tokenizer.

  • pad_label (str, optional, defaults to 'O') – a pad label both for punctuation and capitalization. This label is also a neutral label (used for marking words which do not need punctuation and capitalization).

  • punct_label_ids (Dict[str, int], optional) – a dictionary which keys are punctuation labels and values are label ids. The pad label pad_label has to have id 0. You can provide at most one of parameters punct_label_ids and punct_label_vocab_file. If none of parameters punct_label_ids and punct_label_vocab_file is provided, then punctuation label ids will be inferred from labels_file file.

  • capit_label_ids (Dict[str, int], optional) – same as punct_label_ids for capitalization labels.

  • punct_label_vocab_file (Union[os.PathLike, str], optional) – a path to a file with punctuation labels. These labels include pad label. The pad label has to be the first label in the file. Each label is written on a separate line. Alternatively you can use punct_labels_ids parameter. If none of parameters punct_labels_ids and punct_label_vocab_file is provided, then punctuation label ids will be inferred from labels_file file.

  • capit_label_vocab_file (Union[os.PathLike, str], optional) – same as punct_label_vocab_file for capitalization labels.

  • tar_file_prefix (str, optional, defaults 'punctuation_capitalization') – a string from which tar file names start. The string can contain only characters A-Z, a-z, 0-9, _, -, ..

  • n_jobs (int, optional) – a number of workers for creating tarred dataset. If None, then n_jobs is equal to number of CPUs.

class nemo.collections.nlp.data.token_classification.punctuation_capitalization_tarred_dataset.BertPunctuationCapitalizationTarredDataset(*args: Any, **kwargs: Any)[source]#

Bases: torch.utils.data.IterableDataset

Punctuation capitalization dataset which allows not to load all data in memory simultaneously. A tarred dataset is created from text and label files using script examples/nlp/token_classification/data/create_punctuation_capitalization_tarred_dataset.py or function create_tarred_dataset().

Parameters
  • metadata_file (Union[os.PathLike, str]) –

    a path to tarred dataset metadata file. Metadata file and files referenced in metadata file are created by examples/nlp/token_classification/data/create_punctuation_capitalization_tarred_dataset.py. Metadata file is a JSON file which contains 'num_batches', 'tar_files', 'punct_label_vocab_file', 'capit_label_vocab_file' items. The first item is total number of batches in a dataset, the second is a list of paths to tar files relative to directory containing metadata_file. Items 'punct_label_vocab_file' and 'capit_label_vocab_file' are paths to .csv files which contain unique punctuation an capitalization label vocabularies. Vocabulary file paths are relative to directory containing the metadata_file. Each line in 'punct_label_vocab_file' and 'capit_label_vocab_file' contains 1 label. The first lines in 'punct_label_vocab_file' and 'capit_label_vocab_file' files are neutral labels which also serve as pad labels. Neutral labels for punctuation and capitalization must be equal to the pad_label parameter.

  • tokenizer (TokenizerSpec) – a tokenizer instance used for tokenization of dataset source. A tokenizer instance is used for getting ids of [CLS], [PAD], and [SEP] tokens which are used for masks creation.

  • pad_label (str) – a label that is used for padding and for absence of punctuation or capitalization. Used for checking items 'punct_label_vocab' and 'capit_label_vocab' of dictionary in metadata_file.

  • label_info_save_dir (Union[os.PathLike, str], optional) – a path to a directory where label vocabularies are copied when method save_labels_and_get_file_paths() is called. This parameter is useful if tarred dataset directory is read-only.

  • ignore_extra_tokens (bool, optional, defaults to False) – whether to use only first token in a word for loss computation and training. If set to True, then loss will be computed only for the first tokens of words.

  • ignore_start_end (bool, optional, defaults to True) – whether to compute loss for [CLS] and [SEP] tokens. If set to True, then loss will not be computed for [CLS] and [SEP] tokens.

  • world_size (int, optional, defaults to 1) – a number of processes used for model training. It is used together with a global_rank parameter to decide which tar files will be used in the current process.

  • global_rank (int, optional, defaults to 0) – a number of current process in the pool of workers used for model training. It is used together with world_size parameter to decide which tar files will be used in the current process.

  • shuffle_n (int, optional, defaults to 1) – a number of shuffled batches in a buffer. shuffle_n batches are loaded into memory, shuffled, and then yielded by a dataset instance.

  • shard_strategy (str, defaults to :obj:'scatter') –

    Tarred dataset shard distribution strategy chosen as a str value during ddp. - 'scatter': The default shard strategy applied by WebDataset, where each node gets

    a unique set of shards, which are permanently pre-allocated and never changed at runtime.

    • 'replicate': Optional shard strategy, where each node gets all of the set of shards available in the tarred dataset, which are permanently pre-allocated and never changed at runtime. The benefit of replication is that it allows each node to sample data points from the entire dataset independently of other nodes, and reduces dependence on value of :param:`shuffle_n`.

      Warning

      Replicated strategy allows every node to sample the entire set of available tarfiles, and therefore more than one node may sample the same tarfile, and even sample the same data points! As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch. Scattered strategy, on the other hand, on specific occasions (when the number of shards is not divisible with world_size), will not sample the entire dataset. For these reasons it is not advisable to use tarred datasets as validation or test datasets.

__iter__() Iterator[Dict[str, numpy.ndarray]][source]#

Constructs an iterator of batches. The values of one batch dictionary are numpy arrays of identical shapes [Batch, Time].

Returns

an iterator of batches with items:

  • 'input_ids': np.int32 array containing encoded tokens,

  • 'subtokens_mask': bool array which elements are True if they correspond to first token in a word,

  • 'punct_labels': np.int32 array with encoded punctuation labels,

  • 'capit_labels': np.int32 array with encoded capitalization labels,

  • 'segment_ids': np.int8 array filled with zeros (BERT token types in HuggingFace terminology),

  • 'input_mask': bool array which elements are True if corresponding token is not a padding token,

  • 'loss_mask': bool array which elements are True if loss is computed for corresponding token. See more in description of constructor parameters ignore_start_end, ignore_extra_tokens.

Return type

Iterator[Dict[str, np.ndarray]]

check_for_label_consistency_with_model_config(punct_label_ids: Optional[Dict[str, int]], capit_label_ids: Optional[Dict[str, int]], class_labels: omegaconf.DictConfig, common_dataset_parameters_config: omegaconf.DictConfig) None[source]#

Checks that label ids loaded from tarred dataset are identical to those provided in model.common_dataset_parameters config item. In addition, this method checks that label ids set in attributes punct_label_ids and capit_label_ids of an instance of PunctuationCapitalizationModel are identical to label ids loaded from tarred dataset.

Parameters
  • punct_label_ids – a content of punct_label_ids attribute of an instance of PunctuationCapitalizationModel in which this tarred dataset is used.

  • capit_label_ids – a content of capit_label_ids attribute of an instance of PunctuationCapitalizationModel in which this tarred dataset is used.

  • class_labels – a config item model.class_labels. See more in description of class labels config.

  • common_dataset_parameters_config – a config item model.common_dataset_parameters. See more in of common dataset parameters config.

static collate_fn(batches: List[Dict[str, numpy.ndarray]]) Dict[str, torch.Tensor][source]#

Return zeroth batch of batches list passed for collating and casts 'segment_ids', 'punct_labels', 'capit_labels' to types supported by PunctuationCapitalizationModel. All output tensors have shape [Batch, Time].

Warning

batch size parameter of a PyTorch data loader and sampler has to be 1.

Parameters

batches (List[Dict[str, np.ndarray]]) – a list of batches passed for collating

Returns

a batch dictionary with following items (for detailed description of batch items see method __getitem__()):

Return type

Dict[str, torch.Tensor]

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]#

Returns neural types of batches yielded by this dataset.

save_labels_and_get_file_paths(punct_labels_file_name: str, capit_labels_file_name: str) Tuple[pathlib.Path, pathlib.Path][source]#

Copies label vocabulary files for punctuation and capitalization into directory passed in the constructor parameter label_info_save_dir. The names of new files are punct_labels_file_name and capit_labels_file_name.

The signatures of this method and the signature of the method save_labels_and_get_file_paths() must be identical.

Parameters
  • punct_labels_file_name (str) – a name of punctuation labels file

  • capit_labels_file_name (str) – a name of capitalization labels file

Returns

a tuple of 2 elements

  • pathlib.Path: a path to the new punctuation label ids file

  • pathlib.Path: a path to the new capitalization label ids file

Return type

Tuple[Path, Path]

class nemo.collections.nlp.data.token_classification.punctuation_capitalization_infer_dataset.BertPunctuationCapitalizationInferDataset(*args: Any, **kwargs: Any)[source]#

Bases: nemo.core.classes.dataset.Dataset

Creates dataset to use during inference for punctuation and capitalization tasks with a pretrained model. For dataset to use during training with labels, see BertPunctuationCapitalizationDataset and BertPunctuationCapitalizationTarredDataset.

Parameters max_seq_length, step, margin are for controlling the way queries are split into segments which then processed by the model. Parameter max_seq_length is a length of a segment after tokenization including special tokens [CLS] in the beginning and [SEP] in the end of a segment. Parameter step is shift between consequent segments. Parameter margin is used to exclude negative effect of subtokens near borders of segments which have only one side context.

Parameters
  • queries (List[str]) – list of sequences.

  • tokenizer (TokenizerSpec) – a tokenizer which was used for model training. It should have properties cls_id, sep_id, unk_id, pad_id.

  • max_seq_length (int, optional, defaults to 128) – max sequence length which includes [CLS] and [SEP] tokens

  • step (int, optional, defaults to 8) – relative shift of consequent segments into which long queries are split. Long queries are split into segments which can overlap. Parameter step controls such overlapping. Imagine that queries are tokenized into characters, max_seq_length=5, and step=2. In such a case query “hello” is tokenized into segments [['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']].

  • margin (int, optional, defaults to 16) – number of subtokens in the beginning and the end of segments which are not used for prediction computation. The first segment does not have left margin and the last segment does not have right margin. For example, if input sequence is tokenized into characters, max_seq_length=5, step=1, and margin=1, then query “hello” will be tokenized into segments [['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'e', 'l', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']]. These segments are passed to the model. Before final predictions computation, margins are removed. In the next list, subtokens which logits are not used for final predictions computation are marked with asterisk: [['[CLS]'*, 'h', 'e', 'l'*, '[SEP]'*], ['[CLS]'*, 'e'*, 'l', 'l'*, '[SEP]'*], ['[CLS]'*, 'l'*, 'l', 'o', '[SEP]'*]].

__getitem__(idx: int) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, int, int, bool, bool][source]#

Returns batch used for punctuation and capitalization inference.

Parameters

idx (int) – a batch index

Returns

a tuple containing:

  • input_ids (np.ndarray): an integer numpy array of shape [Time]. Ids of word subtokens encoded using tokenizer passed in constructor tokenizer parameter.

  • segment_ids (np.ndarray): an integer zeros numpy array of shape [Time]. Indices of segments for BERT model (token types in HuggingFace terminology).

  • input_mask (np.ndarray): a boolean numpy array of shape [Time]. An element of this array is True if corresponding token is not padding token.

  • subtokens_mask (np.ndarray): a boolean numpy array of shape [Time]. An element equals True if corresponding token is the first token in a word and False otherwise. For example, if input query "language processing" is tokenized into ["[CLS]", "language", "process", "ing", "SEP"], then subtokens_mask will be [False, True, True, False, False].

  • quantities_of_preceding_words (int): a number of words preceding current segment in the query to which the segment belongs. This parameter is used for uniting predictions from adjacent segments.

  • query_ids (int): an index of query to which the segment belongs

  • is_first (bool): whether a segment is the first segment in a query. The left margin of the first segment in a query is not removed.

  • is_last (bool): whether a query is the last query in a query. The right margin of the last segment in a query is not removed.

Return type

Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, int, int, bool, bool]

collate_fn(batch: List[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, int, int, bool, bool]]) Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, Tuple[int, ...], Tuple[int, ...], Tuple[bool, ...], Tuple[bool, ...]][source]#

Collates samples into batches.

Parameters

batch (List[tuple]) – a list of samples returned by __getitem__() method.

Returns

a tuple containing 8 elements:

  • input_ids (torch.Tensor): an integer tensor of shape [Batch, Time] containing encoded input text.

  • segment_ids (torch.Tensor): an integer tensor of shape [Batch, Time] filled with zeros.

  • input_mask (torch.Tensor): a boolean tensor of shape [Batch, Time] which elements are True if corresponding token is not a padding token.

  • subtokens_mask (torch.Tensor): a boolean tensor of shape [Batch, Time] which elements are True if corresponding tken is the first token in a word.

  • quantities_of_preceding_words (Tuple[int, ...]): a tuple containing number of words in a query preceding current segment.

  • query_ids (Tuple[int, ...]): a tuple containing indices of queries to which segments belong.

  • is_first (Tuple[bool, ...]): a tuple booleans which elements are True if corresponding segment is the first segment in a query.

  • is_last (Tuple[bool, ...]): a tuple of booleans which elements are True if corresponding segment is the last segment in a query.

Return type

Tuple[torch.Tensor (x4), Tuple[int, ...] (x2), Tuple[bool, ...] (x2)]

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]#

Returns neural types of collate_fn() output.