NeMo NLP collection API
Contents
NeMo NLP collection API#
Model Classes#
Modules#
Datasets#
- class nemo.collections.nlp.data.token_classification.punctuation_capitalization_dataset.BertPunctuationCapitalizationDataset(*args: Any, **kwargs: Any)[source]#
Bases:
nemo.core.classes.dataset.Dataset
A dataset to use during training for punctuation and capitalization tasks. For inference, you will need
BertPunctuationCapitalizationInferDataset
. For huge datasets which cannot be loaded into memory simultaneously useBertPunctuationCapitalizationTarredDataset
.- Parameters
text_file (
Union[str, os.PathLike]
) – a path to a file with sequences, each line should contain a text without punctuation and capitalizationlabels_file (
Union[str, os.PathLike]
) – a path to a file with labels, each line corresponds to word labels for a sentence in thetext_file
. Labels have to follow format described in this section of documentation NeMo Data Format.max_seq_length (
int
) – max number of tokens in a source sequence.max_seq_length
includes for [CLS] and [SEP] tokens. Sequences which are too long will be clipped by removal of tokens from the end of the sequence.tokenizer (
TokenizerSpec
) – a tokenizer instance which has propertiesunk_id
,sep_id
,bos_id
,eos_id
.num_samples (
int
, optional, defaults to-1
) – a number of samples you want to use for the dataset. If-1
, use all dataset. Useful for testing.tokens_in_batch (
int
, optional, defaults to5000
) – number of tokens in a batch including paddings and special tokens ([CLS], [SEP], [UNK]). This class__getitem__()
method returns not samples but ready batches. Number of samples in a batch is adjusted for input sequences lengths. If input sequences are short, then a batch will contain more samples. Before packing into batches, samples are sorted by number of tokens they contain. Sorting allows to reduce number of pad tokens in a batch significantly. Regular PyTorch data loader shuffling will only permute batches with changing their content. Proper shuffling is achieved via calling methodrepack_batches_with_shuffle()
every epoch. If parameternumber_of_batches_is_multiple_of
is greater than 1, some batches may be split into smaller pieces.pad_label (
str
, optional, defaults to'O'
) – pad value to use for labels. It’s also the neutral label both for punctuation and capitalization.punct_label_ids (
Dict[str, int]
, optional) – dict to map punctuation labels to label ids. For dev set, use label ids generated during training to support cases when not all labels are present in the dev set. For training, it is recommended to setpunct_label_ids
toNone
or load from cache.capit_label_ids (
Dict[str, int]
, optional) – samepunct_label_ids
for capitalization labels.ignore_extra_tokens (
bool
, optional, defaults toFalse
) – whether to compute loss on tokens which are not first tokens in a word. For example, assume that word'tokenization'
is tokenized into['token', 'ization']
. Ifignore_extra_tokens=True
, loss mask for the word is[True, False]
, and ifignore_extra_tokens=False
, then loss mask is[True, True]
.ignore_start_end (
bool
, optional, defaults toTrue
) – whether to ignore [CLS] and [SEP] tokens in the loss_mask.use_cache (
bool
, optional, defaults toTrue
) – whether to use pickled features already present incache_dir
or not. If pickled features file does not exist oruse_cache=False
, then features are pickled incache_dir
. Pickled features include input ids, subtokens mask (mask of first tokens in words), encoded punctuation and capitalization labels, label ids. Features creation consumes considerable time and thisuse_cache=True
significantly speeds up training starting. Pickled features are also used for sharing features between processes if data parallel training is used.cache_dir (
Union[str, os.PathLike]
, optional) – a path to a directory where cache (pickled features) is stored. By default,text_file
parent directory is used. This parameter is useful if dataset directory is read-only, and you wish to pickle features. In such a case specify a path to directory which allows writing incache_dir
parameter.get_label_frequencies (
bool
, optional, defaults toFalse
) – whether to print and save label frequencies. Frequencies are showed ifverbose
parameter isTrue
. Ifget_label_frequencies=True
, then frequencies are saved intolabel_info_save_dir
directory.label_info_save_dir (
Union[str, os.PathLike]
, optional) – a path to a directory where label frequencies are saved. By default, atext_file
parent directory is used. When methodsave_labels_and_get_file_paths()
is called label ids are saved intolabel_info_save_dir
directory. This parameter is useful if directory containingtext_file
is read-only.punct_label_vocab_file (
Union[str, os.PathLike]
, optional) – a path to a .csv file containing punctuation label vocabulary. Each line in such a vocabulary file contains exactly one label. The first line has to contain pad_label, otherwise error will be raised.capit_label_vocab_file (
Union[str, os.PathLike]
, optional) – same aspunct_label_vocab_file
for capitalization labels.add_masks_and_segment_ids_to_batch (
bool
, optional, defaults toTrue
) – whether to add'loss_mask'
,'input_mask'
,'segment_ids'
items to a batch. Useful for creation of tarred dataset and can NOT be used during model training and inference.verbose (
bool
, optional, defaults toTrue
) – whether to show data examples, label stats and other useful information.n_jobs (
int
, optional, defaults to0
) –number of workers used for tokenization, encoding labels, creating “first token in word” mask, and clipping. If
n_jobs <= 0
data preparation is performed without multiprocessing. By default,n_jobs
is0
.Warning
There can be deadlocking problems with some tokenizers (e.g. SentencePiece, HuggingFace AlBERT) if
n_jobs > 0
.number_of_batches_is_multiple_of (
int
, optional, defaults to1
) – number of batches in the dataset is made divisible bynumber_of_batches_is_multiple_of
. Ifnumber_of_batches_is_multiple_of
is greater than 1, then several batches are split in parts until number of batches is divisible bynumber_of_batches_is_multiple_of
. If there is no enough queries in the dataset to create enough batches, then a warning is printed. This parameter is useful for dev and validation datasets if multiple GPUs are used. The problem is that if number of batches is not evenly divisible by number of GPUs, then some queries may be processed several times and metrics will be distorted.batch_shuffling_random_seed (
int
, defaults toint
) – a random seed used for batches repacking and shuffling.tokenization_progress_queue (
multiprocessing.Queue
, optional) – a queue for reporting tokenization progress. Useful for creation of tarred datasetbatch_mark_up_progress_queue (
multiprocessing.Queue
, optional) – a queue for reporting progress in deciding which samples batches will contain. Useful for creation of tarred datasetbatch_building_progress_queue (
multiprocessing.Queue
, optional) – a queue for reporting progress in batch creation (stacking and padding). Useful for creation of tarred dataset
:param use_audio (
bool
: obj: False): If set to True dataset will return audio as well as text. :param optional: obj: False): If set to True dataset will return audio as well as text. :param defaults to: obj: False): If set to True dataset will return audio as well as text. :param audio_file: a path to file with audio paths. :type audio_file:Union[str, os.PathLike]
, optional :param sample_rate: sample rate of audios. Can be used for up sampling or down sampling of audio. :type sample_rate:int
, optional, defaults toNone
:param use_bucketing (bool
: obj: True): If set to False dataset will returnbatch_size
batches instead ofnumber_of_tokens
tokens. :param optional: obj: True): If set to False dataset will returnbatch_size
batches instead ofnumber_of_tokens
tokens. :param defaults to: obj: True): If set to False dataset will returnbatch_size
batches instead ofnumber_of_tokens
tokens. :param preload_audios (bool
: obj: True): If set to True batches will include waveforms, if set to False will store audio_filepaths instead and load audios duringcollate_fn
call :param optional: obj: True): If set to True batches will include waveforms, if set to False will store audio_filepaths instead and load audios duringcollate_fn
call :param defaults to: obj: True): If set to True batches will include waveforms, if set to False will store audio_filepaths instead and load audios duringcollate_fn
call- __getitem__(idx: int) Dict[str, numpy.ndarray] [source]#
Return a batch with index
idx
. The values of a batch dictionary are numpy arrays of identical shapes[Batch, Time]
. Labels are identical for all tokens in a word. For example, ifword
'Tokenization'
is tokenized into tokens['token', 'ization']
,it is followed by comma,
then punctuation labels are
[',', ',']
and capitalization labels are['U', 'U']
('U'
is a label for words which start with upper case character).- Parameters
idx – an index of returned batch
- Returns
a dictionary with items:
'input_ids'
(numpy.ndarray
):numpy.int32
array containing encoded tokens,'subtokens_mask'
(numpy.ndarray
):bool
array which elements areTrue
if they correspond to first token in a word,'punct_labels'
(numpy.ndarray
):numpy.int32
array containing encoded punctuation labels,'capit_labels'
(numpy.ndarray
):numpy.int32
array containing encoded capitalization labels.'segment_ids'
(numpy.ndarray
):numpy.int8
array filled with zeros (BERT token types in HuggingFace terminology) (ifself.add_masks_and_segment_ids_to_batch
isFalse
, then these items is missing),'input_mask'
(numpy.ndarray
):bool
array which elements areTrue
if corresponding token is not a padding token (ifself.add_masks_and_segment_ids_to_batch
isFalse
, then these items is missing),'loss_mask'
(numpy.ndarray
):bool
array which elements areTrue
if loss is computed for corresponding token. See more in description of constructor parametersignore_start_end
,ignore_extra_tokens
(ifself.add_masks_and_segment_ids_to_batch
isFalse
, then these items is missing).'features'
(numpy.ndarray
)np.float
array of waveforms of audio ifself.preload_audio
is set toTrue
else empty.'features_length'
(numpy.ndarray
)np.long
array of number of samples per audio.'audio_filepaths'
(List
)str
contains paths of audio files ifself.preload_audio
set toFalse
- Return type
Dict[str, np.ndarray]
- static calc_batch_seq_length(queries: List[numpy.ndarray], length_is_multiple_of: int) int [source]#
- collate_fn(batches: List[Dict[str, numpy.ndarray]]) Dict[str, torch.Tensor] [source]#
If
self.use_bucketing
set toTrue
returns zeroth batch frombatches
list passed for collating and casts'segment_ids'
,'punct_labels'
,'capit_labels'
to types supported byPunctuationCapitalizationModel
orPunctuationCapitalizationLexicalAudioModel
ifself.use_audio
set toTrue
All output tensors have shape[Batch, Time]
.Warning
A
batch_size
parameter of a PyTorch data loader and sampler has to be1
ifself.use_bucketing
set toTrue
- Parameters
batches (
List[Dict[str, np.ndarray]]
) – a list containing 1 batch passed for collating- Returns
a batch dictionary with following items (for detailed description of batch items see method
__getitem__()
):'input_ids'
(torch.Tensor
):torch.int32
tensor,'subtokens_mask'
(torch.Tensor
):torch.bool
tensor,'punct_labels'
(torch.Tensor
):torch.int64
tensor,'capit_labels'
(torch.Tensor
):torch.int64
tensor,'segment_ids'
(torch.Tensor
):torch.int32
tensor,'input_mask'
(torch.Tensor
):torch.bool
tensor,'loss_mask'
(torch.Tensor
):torch.bool
tensor.'features'
(torch.Tensor
):torch.float
tensor.'features_length'
(torch.Tensor
):torch.long
tensor.
- Return type
Dict[str, torch.Tensor]
- property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]#
Returns definitions of module output ports.
- repack_batches_with_shuffle() None [source]#
A function for proper shuffling of a dataset. Pytorch data loader shuffling will only permute batches.
- save_labels_and_get_file_paths(punct_labels_file_name: str, capit_labels_file_name: str) Tuple[pathlib.Path, pathlib.Path] [source]#
Saves label ids into files located in
self.label_info_save_dir
. Saved label ids are usually used for.nemo
checkpoint creation.The signatures of this method and the signature of the method
save_labels_and_get_file_paths()
must be identical.- Parameters
punct_labels_file_name (
str
) – a name of a punctuation labels filecapit_labels_file_name (
str
) – a name of a capitalization labels file
- Returns
a tuple containing:
pathlib.Path
: a path to the saved punctuation labels filepathlib.Path
: a path to the saved capitalization labels file
- Return type
Tuple[pathlib.Path, pathlib.Path]
- class nemo.collections.nlp.data.token_classification.punctuation_capitalization_infer_dataset.BertPunctuationCapitalizationInferDataset(*args: Any, **kwargs: Any)[source]#
Bases:
nemo.core.classes.dataset.Dataset
Creates dataset to use during inference for punctuation and capitalization tasks with a pretrained model. For dataset to use during training with labels, see
BertPunctuationCapitalizationDataset
andBertPunctuationCapitalizationTarredDataset
.Parameters
max_seq_length
,step
,margin
are for controlling the way queries are split into segments which then processed by the model. Parametermax_seq_length
is a length of a segment after tokenization including special tokens [CLS] in the beginning and [SEP] in the end of a segment. Parameterstep
is shift between consequent segments. Parametermargin
is used to exclude negative effect of subtokens near borders of segments which have only one side context.- Parameters
queries (
List[str]
) – list of sequences.tokenizer (
TokenizerSpec
) – a tokenizer which was used for model training. It should have propertiescls_id
,sep_id
,unk_id
,pad_id
.max_seq_length (
int
, optional, defaults to128
) – max sequence length which includes [CLS] and [SEP] tokensstep (
int
, optional, defaults to8
) – relative shift of consequent segments into which long queries are split. Long queries are split into segments which can overlap. Parameterstep
controls such overlapping. Imagine that queries are tokenized into characters,max_seq_length=5
, andstep=2
. In such a case query “hello” is tokenized into segments[['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']]
.margin (
int
, optional, defaults to16
) – number of subtokens in the beginning and the end of segments which are not used for prediction computation. The first segment does not have left margin and the last segment does not have right margin. For example, if input sequence is tokenized into characters,max_seq_length=5
,step=1
, andmargin=1
, then query “hello” will be tokenized into segments[['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'e', 'l', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']]
. These segments are passed to the model. Before final predictions computation, margins are removed. In the next list, subtokens which logits are not used for final predictions computation are marked with asterisk:[['[CLS]'*, 'h', 'e', 'l'*, '[SEP]'*], ['[CLS]'*, 'e'*, 'l', 'l'*, '[SEP]'*], ['[CLS]'*, 'l'*, 'l', 'o', '[SEP]'*]]
.
- __getitem__(idx: int) Union[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, int, int, bool, bool], Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, int, int, bool, bool, numpy.ndarray, List[int]]] [source]#
Returns batch used for punctuation and capitalization inference.
- Parameters
idx (
int
) – a batch index- Returns
a tuple containing:
input_ids
(np.ndarray
): an integer numpy array of shape[Time]
. Ids of word subtokens encoded using tokenizer passed in constructortokenizer
parameter.segment_ids
(np.ndarray
): an integer zeros numpy array of shape[Time]
. Indices of segments for BERT model (token types in HuggingFace terminology).input_mask
(np.ndarray
): a boolean numpy array of shape[Time]
. An element of this array isTrue
if corresponding token is not padding token.subtokens_mask
(np.ndarray
): a boolean numpy array of shape[Time]
. An element equalsTrue
if corresponding token is the first token in a word andFalse
otherwise. For example, if input query"language processing"
is tokenized into["[CLS]", "language", "process", "ing", "SEP"]
, thensubtokens_mask
will be[False, True, True, False, False]
.quantities_of_preceding_words
(int
): a number of words preceding current segment in the query to which the segment belongs. This parameter is used for uniting predictions from adjacent segments.query_ids
(int
): an index of query to which the segment belongsis_first
(bool
): whether a segment is the first segment in a query. The left margin of the first segment in a query is not removed.is_last
(bool
): whether a query is the last query in a query. The right margin of the last segment in a query is not removed.
- Return type
Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, int, int, bool, bool]
- collate_fn(batch: List[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, int, int, bool, bool, Optional[numpy.ndarray], Optional[numpy.ndarray]]]) Union[Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, Any, Any, Any, Any], Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, Any, Any, Any, Any, Any, Any]] [source]#
Collates samples into batches.
- Parameters
batch (
List[tuple]
) – a list of samples returned by__getitem__()
method.- Returns
a tuple containing 8 elements:
input_ids
(torch.Tensor
): an integer tensor of shape[Batch, Time]
containing encoded input text.segment_ids
(torch.Tensor
): an integer tensor of shape[Batch, Time]
filled with zeros.input_mask
(torch.Tensor
): a boolean tensor of shape[Batch, Time]
which elements areTrue
if corresponding token is not a padding token.subtokens_mask
(torch.Tensor
): a boolean tensor of shape[Batch, Time]
which elements areTrue
if corresponding tken is the first token in a word.quantities_of_preceding_words
(Tuple[int, ...]
): a tuple containing number of words in a query preceding current segment.query_ids
(Tuple[int, ...]
): a tuple containing indices of queries to which segments belong.is_first
(Tuple[bool, ...]
): a tuple booleans which elements areTrue
if corresponding segment is the first segment in a query.is_last
(Tuple[bool, ...]
): a tuple of booleans which elements areTrue
if corresponding segment is the last segment in a query.
- Return type
Tuple[torch.Tensor (x4), Tuple[int, ...] (x2), Tuple[bool, ...] (x2)]
- property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]#
Returns neural types of
collate_fn()
output.