NeMo NLP collection API#

Model Classes#

Modules#

nemo.collections.nlp.modules.common.megatron.get_megatron_lm_models_list() List[str][source]#

Returns the list of supported Megatron-LM models

Datasets#

class nemo.collections.nlp.data.token_classification.punctuation_capitalization_dataset.BertPunctuationCapitalizationDataset(*args: Any, **kwargs: Any)[source]#

Bases: nemo.core.classes.dataset.Dataset

A dataset to use during training for punctuation and capitalization tasks. For inference, you will need BertPunctuationCapitalizationInferDataset. For huge datasets which cannot be loaded into memory simultaneously use BertPunctuationCapitalizationTarredDataset.

Parameters
  • text_file (Union[str, os.PathLike]) – a path to a file with sequences, each line should contain a text without punctuation and capitalization

  • labels_file (Union[str, os.PathLike]) – a path to a file with labels, each line corresponds to word labels for a sentence in the text_file. Labels have to follow format described in this section of documentation NeMo Data Format.

  • max_seq_length (int) – max number of tokens in a source sequence. max_seq_length includes for [CLS] and [SEP] tokens. Sequences which are too long will be clipped by removal of tokens from the end of the sequence.

  • tokenizer (TokenizerSpec) – a tokenizer instance which has properties unk_id, sep_id, bos_id, eos_id.

  • num_samples (int, optional, defaults to -1) – a number of samples you want to use for the dataset. If -1, use all dataset. Useful for testing.

  • tokens_in_batch (int, optional, defaults to 5000) – number of tokens in a batch including paddings and special tokens ([CLS], [SEP], [UNK]). This class __getitem__() method returns not samples but ready batches. Number of samples in a batch is adjusted for input sequences lengths. If input sequences are short, then a batch will contain more samples. Before packing into batches, samples are sorted by number of tokens they contain. Sorting allows to reduce number of pad tokens in a batch significantly. Regular PyTorch data loader shuffling will only permute batches with changing their content. Proper shuffling is achieved via calling method repack_batches_with_shuffle() every epoch. If parameter number_of_batches_is_multiple_of is greater than 1, some batches may be split into smaller pieces.

  • pad_label (str, optional, defaults to 'O') – pad value to use for labels. It’s also the neutral label both for punctuation and capitalization.

  • punct_label_ids (Dict[str, int], optional) – dict to map punctuation labels to label ids. For dev set, use label ids generated during training to support cases when not all labels are present in the dev set. For training, it is recommended to set punct_label_ids to None or load from cache.

  • capit_label_ids (Dict[str, int], optional) – same punct_label_ids for capitalization labels.

  • ignore_extra_tokens (bool, optional, defaults to False) – whether to compute loss on tokens which are not first tokens in a word. For example, assume that word 'tokenization' is tokenized into ['token', 'ization']. If ignore_extra_tokens=True, loss mask for the word is [True, False], and if ignore_extra_tokens=False, then loss mask is [True, True].

  • ignore_start_end (bool, optional, defaults to True) – whether to ignore [CLS] and [SEP] tokens in the loss_mask.

  • use_cache (bool, optional, defaults to True) – whether to use pickled features already present in cache_dir or not. If pickled features file does not exist or use_cache=False, then features are pickled in cache_dir. Pickled features include input ids, subtokens mask (mask of first tokens in words), encoded punctuation and capitalization labels, label ids. Features creation consumes considerable time and this use_cache=True significantly speeds up training starting. Pickled features are also used for sharing features between processes if data parallel training is used.

  • cache_dir (Union[str, os.PathLike], optional) – a path to a directory where cache (pickled features) is stored. By default, text_file parent directory is used. This parameter is useful if dataset directory is read-only, and you wish to pickle features. In such a case specify a path to directory which allows writing in cache_dir parameter.

  • get_label_frequencies (bool, optional, defaults to False) – whether to print and save label frequencies. Frequencies are showed if verbose parameter is True. If get_label_frequencies=True, then frequencies are saved into label_info_save_dir directory.

  • label_info_save_dir (Union[str, os.PathLike], optional) – a path to a directory where label frequencies are saved. By default, a text_file parent directory is used. When method save_labels_and_get_file_paths() is called label ids are saved into label_info_save_dir directory. This parameter is useful if directory containing text_file is read-only.

  • punct_label_vocab_file (Union[str, os.PathLike], optional) – a path to a .csv file containing punctuation label vocabulary. Each line in such a vocabulary file contains exactly one label. The first line has to contain pad_label, otherwise error will be raised.

  • capit_label_vocab_file (Union[str, os.PathLike], optional) – same as punct_label_vocab_file for capitalization labels.

  • add_masks_and_segment_ids_to_batch (bool, optional, defaults to True) – whether to add 'loss_mask', 'input_mask', 'segment_ids' items to a batch. Useful for creation of tarred dataset and can NOT be used during model training and inference.

  • verbose (bool, optional, defaults to True) – whether to show data examples, label stats and other useful information.

  • n_jobs (int, optional, defaults to 0) –

    number of workers used for tokenization, encoding labels, creating “first token in word” mask, and clipping. If n_jobs <= 0 data preparation is performed without multiprocessing. By default, n_jobs is 0.

    Warning

    There can be deadlocking problems with some tokenizers (e.g. SentencePiece, HuggingFace AlBERT) if n_jobs > 0.

  • number_of_batches_is_multiple_of (int, optional, defaults to 1) – number of batches in the dataset is made divisible by number_of_batches_is_multiple_of. If number_of_batches_is_multiple_of is greater than 1, then several batches are split in parts until number of batches is divisible by number_of_batches_is_multiple_of. If there is no enough queries in the dataset to create enough batches, then a warning is printed. This parameter is useful for dev and validation datasets if multiple GPUs are used. The problem is that if number of batches is not evenly divisible by number of GPUs, then some queries may be processed several times and metrics will be distorted.

  • batch_shuffling_random_seed (int, defaults to int) – a random seed used for batches repacking and shuffling.

  • tokenization_progress_queue (multiprocessing.Queue, optional) – a queue for reporting tokenization progress. Useful for creation of tarred dataset

  • batch_mark_up_progress_queue (multiprocessing.Queue, optional) – a queue for reporting progress in deciding which samples batches will contain. Useful for creation of tarred dataset

  • batch_building_progress_queue (multiprocessing.Queue, optional) – a queue for reporting progress in batch creation (stacking and padding). Useful for creation of tarred dataset

:param use_audio (bool: obj: False): If set to True dataset will return audio as well as text. :param optional: obj: False): If set to True dataset will return audio as well as text. :param defaults to: obj: False): If set to True dataset will return audio as well as text. :param audio_file: a path to file with audio paths. :type audio_file: Union[str, os.PathLike], optional :param sample_rate: sample rate of audios. Can be used for up sampling or down sampling of audio. :type sample_rate: int, optional, defaults to None :param use_bucketing (bool: obj: True): If set to False dataset will return batch_size batches instead of number_of_tokens tokens. :param optional: obj: True): If set to False dataset will return batch_size batches instead of number_of_tokens tokens. :param defaults to: obj: True): If set to False dataset will return batch_size batches instead of number_of_tokens tokens. :param preload_audios (bool: obj: True): If set to True batches will include waveforms, if set to False will store audio_filepaths instead and load audios during collate_fn call :param optional: obj: True): If set to True batches will include waveforms, if set to False will store audio_filepaths instead and load audios during collate_fn call :param defaults to: obj: True): If set to True batches will include waveforms, if set to False will store audio_filepaths instead and load audios during collate_fn call

__getitem__(idx: int) Dict[str, numpy.ndarray][source]#

Return a batch with index idx. The values of a batch dictionary are numpy arrays of identical shapes [Batch, Time]. Labels are identical for all tokens in a word. For example, if

  • word 'Tokenization' is tokenized into tokens ['token', 'ization'],

  • it is followed by comma,

then punctuation labels are [',', ','] and capitalization labels are ['U', 'U'] ('U' is a label for words which start with upper case character).

Parameters

idx – an index of returned batch

Returns

a dictionary with items:

  • 'input_ids' (numpy.ndarray): numpy.int32 array containing encoded tokens,

  • 'subtokens_mask' (numpy.ndarray): bool array which elements are True if they correspond to first token in a word,

  • 'punct_labels' (numpy.ndarray): numpy.int32 array containing encoded punctuation labels,

  • 'capit_labels' (numpy.ndarray): numpy.int32 array containing encoded capitalization labels.

  • 'segment_ids' (numpy.ndarray): numpy.int8 array filled with zeros (BERT token types in HuggingFace terminology) (if self.add_masks_and_segment_ids_to_batch is False, then these items is missing),

  • 'input_mask' (numpy.ndarray): bool array which elements are True if corresponding token is not a padding token (if self.add_masks_and_segment_ids_to_batch is False, then these items is missing),

  • 'loss_mask' (numpy.ndarray): bool array which elements are True if loss is computed for corresponding token. See more in description of constructor parameters ignore_start_end, ignore_extra_tokens (if self.add_masks_and_segment_ids_to_batch is False, then these items is missing).

  • 'features' (numpy.ndarray) np.float array of waveforms of audio if self.preload_audio is set to True else empty.

  • 'features_length' (numpy.ndarray) np.long array of number of samples per audio.

  • 'audio_filepaths' (List) str contains paths of audio files if self.preload_audio set to False

Return type

Dict[str, np.ndarray]

static calc_batch_seq_length(queries: List[numpy.ndarray], length_is_multiple_of: int) int[source]#
collate_fn(batches: List[Dict[str, numpy.ndarray]]) Dict[str, torch.Tensor][source]#

If self.use_bucketing set to True returns zeroth batch from batches list passed for collating and casts 'segment_ids', 'punct_labels', 'capit_labels' to types supported by PunctuationCapitalizationModel or PunctuationCapitalizationLexicalAudioModel if self.use_audio set to True All output tensors have shape [Batch, Time].

Warning

A batch_size parameter of a PyTorch data loader and sampler has to be 1 if self.use_bucketing set to True

Parameters

batches (List[Dict[str, np.ndarray]]) – a list containing 1 batch passed for collating

Returns

a batch dictionary with following items (for detailed description of batch items see method __getitem__()):

Return type

Dict[str, torch.Tensor]

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]#

Returns definitions of module output ports.

repack_batches_with_shuffle() None[source]#

A function for proper shuffling of a dataset. Pytorch data loader shuffling will only permute batches.

save_labels_and_get_file_paths(punct_labels_file_name: str, capit_labels_file_name: str) Tuple[pathlib.Path, pathlib.Path][source]#

Saves label ids into files located in self.label_info_save_dir. Saved label ids are usually used for .nemo checkpoint creation.

The signatures of this method and the signature of the method save_labels_and_get_file_paths() must be identical.

Parameters
  • punct_labels_file_name (str) – a name of a punctuation labels file

  • capit_labels_file_name (str) – a name of a capitalization labels file

Returns

a tuple containing:

  • pathlib.Path: a path to the saved punctuation labels file

  • pathlib.Path: a path to the saved capitalization labels file

Return type

Tuple[pathlib.Path, pathlib.Path]

class nemo.collections.nlp.data.token_classification.punctuation_capitalization_infer_dataset.BertPunctuationCapitalizationInferDataset(*args: Any, **kwargs: Any)[source]#

Bases: nemo.core.classes.dataset.Dataset

Creates dataset to use during inference for punctuation and capitalization tasks with a pretrained model. For dataset to use during training with labels, see BertPunctuationCapitalizationDataset and BertPunctuationCapitalizationTarredDataset.

Parameters max_seq_length, step, margin are for controlling the way queries are split into segments which then processed by the model. Parameter max_seq_length is a length of a segment after tokenization including special tokens [CLS] in the beginning and [SEP] in the end of a segment. Parameter step is shift between consequent segments. Parameter margin is used to exclude negative effect of subtokens near borders of segments which have only one side context.

Parameters
  • queries (List[str]) – list of sequences.

  • tokenizer (TokenizerSpec) – a tokenizer which was used for model training. It should have properties cls_id, sep_id, unk_id, pad_id.

  • max_seq_length (int, optional, defaults to 128) – max sequence length which includes [CLS] and [SEP] tokens

  • step (int, optional, defaults to 8) – relative shift of consequent segments into which long queries are split. Long queries are split into segments which can overlap. Parameter step controls such overlapping. Imagine that queries are tokenized into characters, max_seq_length=5, and step=2. In such a case query “hello” is tokenized into segments [['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']].

  • margin (int, optional, defaults to 16) – number of subtokens in the beginning and the end of segments which are not used for prediction computation. The first segment does not have left margin and the last segment does not have right margin. For example, if input sequence is tokenized into characters, max_seq_length=5, step=1, and margin=1, then query “hello” will be tokenized into segments [['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'e', 'l', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']]. These segments are passed to the model. Before final predictions computation, margins are removed. In the next list, subtokens which logits are not used for final predictions computation are marked with asterisk: [['[CLS]'*, 'h', 'e', 'l'*, '[SEP]'*], ['[CLS]'*, 'e'*, 'l', 'l'*, '[SEP]'*], ['[CLS]'*, 'l'*, 'l', 'o', '[SEP]'*]].

__getitem__(idx: int) Union[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, int, int, bool, bool], Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, int, int, bool, bool, numpy.ndarray, List[int]]][source]#

Returns batch used for punctuation and capitalization inference.

Parameters

idx (int) – a batch index

Returns

a tuple containing:

  • input_ids (np.ndarray): an integer numpy array of shape [Time]. Ids of word subtokens encoded using tokenizer passed in constructor tokenizer parameter.

  • segment_ids (np.ndarray): an integer zeros numpy array of shape [Time]. Indices of segments for BERT model (token types in HuggingFace terminology).

  • input_mask (np.ndarray): a boolean numpy array of shape [Time]. An element of this array is True if corresponding token is not padding token.

  • subtokens_mask (np.ndarray): a boolean numpy array of shape [Time]. An element equals True if corresponding token is the first token in a word and False otherwise. For example, if input query "language processing" is tokenized into ["[CLS]", "language", "process", "ing", "SEP"], then subtokens_mask will be [False, True, True, False, False].

  • quantities_of_preceding_words (int): a number of words preceding current segment in the query to which the segment belongs. This parameter is used for uniting predictions from adjacent segments.

  • query_ids (int): an index of query to which the segment belongs

  • is_first (bool): whether a segment is the first segment in a query. The left margin of the first segment in a query is not removed.

  • is_last (bool): whether a query is the last query in a query. The right margin of the last segment in a query is not removed.

Return type

Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, int, int, bool, bool]

collate_fn(batch: List[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, int, int, bool, bool, Optional[numpy.ndarray], Optional[numpy.ndarray]]]) Union[Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, Any, Any, Any, Any], Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, Any, Any, Any, Any, Any, Any]][source]#

Collates samples into batches.

Parameters

batch (List[tuple]) – a list of samples returned by __getitem__() method.

Returns

a tuple containing 8 elements:

  • input_ids (torch.Tensor): an integer tensor of shape [Batch, Time] containing encoded input text.

  • segment_ids (torch.Tensor): an integer tensor of shape [Batch, Time] filled with zeros.

  • input_mask (torch.Tensor): a boolean tensor of shape [Batch, Time] which elements are True if corresponding token is not a padding token.

  • subtokens_mask (torch.Tensor): a boolean tensor of shape [Batch, Time] which elements are True if corresponding tken is the first token in a word.

  • quantities_of_preceding_words (Tuple[int, ...]): a tuple containing number of words in a query preceding current segment.

  • query_ids (Tuple[int, ...]): a tuple containing indices of queries to which segments belong.

  • is_first (Tuple[bool, ...]): a tuple booleans which elements are True if corresponding segment is the first segment in a query.

  • is_last (Tuple[bool, ...]): a tuple of booleans which elements are True if corresponding segment is the last segment in a query.

Return type

Tuple[torch.Tensor (x4), Tuple[int, ...] (x2), Tuple[bool, ...] (x2)]

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]#

Returns neural types of collate_fn() output.