Tokenizers#

Wrapper of HuggingFace AutoTokenizer https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer.

Parameters:

pretrained_model_name – corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument. For more details please refer to the documentation of the from_pretrained method here: https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer. The list of all supported models can be found here: https://huggingface.co/models
vocab_file – path to file with vocabulary which consists of characters separated by newlines.
mask_token – mask token
bos_token – the beginning of sequence token
eos_token – the end of sequence token. Usually equal to sep_token
pad_token – token to use for padding
sep_token – token used for separating sequences
cls_token – class token. Usually equal to bos_token
unk_token – token to use for unknown tokens
additional_special_tokens – list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)
use_fast – whether to use fast HuggingFace tokenizer
include_special_tokens – when True, converting text to ids will include special tokens / prompt tokens (if any), yielding self.tokenizer(text).input_ids
chat_template – The chat template string to format “messages” with against the underlying HF tokneizer with apply_chat_template function

class nemo.collections.common.tokenizers.SentencePieceTokenizer( model_path: str, special_tokens: Dict[str, str] | List[str] | None = None, legacy: bool = False, ignore_extra_whitespaces: bool = True, chat_template: Dict | None = None, trim_spm_separator_after_special_token=True, spm_separator='▁', )#

Sentencepiecetokenizer google/sentencepiece.

Parameters:

model_path – path to sentence piece tokenizer model. To create the model use create_spt_model()
special_tokens – either list of special tokens or dictionary of token name to token value
legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.
ignore_extra_whitespaces – whether to ignore extra whitespaces in the input text while encoding. Note: This is done for the current models tokenizers that don’t handle extra whitespaces as by default tokenizer learned to ignore it. To check if the tokenizer by default ignores extra whitespaces refer to self.removed_extra_spaces attribute of the tokenizer. We added a parameter to process_asr_tokenizer.py for upcoming models to handle it inbuilt.

__init__( model_path: str, special_tokens: Dict[str, str] | List[str] | None = None, legacy: bool = False, ignore_extra_whitespaces: bool = True, chat_template: Dict | None = None, trim_spm_separator_after_special_token=True, spm_separator='▁', )#

class nemo.collections.common.tokenizers.TokenizerSpec#

Inherit this class to implement a new tokenizer.

__init__()#