Tokenizers

class nemo.collections.common.tokenizers.AutoTokenizer(pretrained_model_name: str, vocab_file: Optional[str] = None, merges_file: Optional[str] = None, mask_token: Optional[str] = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: Optional[str] = None, sep_token: Optional[str] = None, cls_token: Optional[str] = None, unk_token: Optional[str] = None, use_fast: Optional[bool] = False, trust_remote_code: Optional[bool] = False)

Wrapper of HuggingFace AutoTokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer.

__init__(pretrained_model_name: str, vocab_file: Optional[str] = None, merges_file: Optional[str] = None, mask_token: Optional[str] = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: Optional[str] = None, sep_token: Optional[str] = None, cls_token: Optional[str] = None, unk_token: Optional[str] = None, use_fast: Optional[bool] = False, trust_remote_code: Optional[bool] = False)

Parameters

pretrained_model_name – corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument. For more details please refer to https://huggingface.co/transformers/_modules/transformers/tokenization_auto.html#AutoTokenizer.from_pretrained. The list of all supported models can be found here: ALL_PRETRAINED_CONFIG_ARCHIVE_MAP
vocab_file – path to file with vocabulary which consists of characters separated by newlines.
mask_token – mask token
bos_token – the beginning of sequence token
eos_token – the end of sequence token. Usually equal to sep_token
pad_token – token to use for padding
sep_token – token used for separating sequences
cls_token – class token. Usually equal to bos_token
unk_token – token to use for unknown tokens
use_fast – whether to use fast HuggingFace tokenizer

class nemo.collections.common.tokenizers.SentencePieceTokenizer(model_path: str, special_tokens: Optional[Union[Dict[str, str], List[str]]] = None, legacy: bool = False, chat_template: Optional[Dict] = None)

Sentencepiecetokenizer https://github.com/google/sentencepiece.

Parameters

model_path – path to sentence piece tokenizer model. To create the model use create_spt_model()
special_tokens – either list of special tokens or dictionary of token name to token value
legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.

__init__(model_path: str, special_tokens: Optional[Union[Dict[str, str], List[str]]] = None, legacy: bool = False, chat_template: Optional[Dict] = None)

class nemo.collections.common.tokenizers.TokenizerSpec

Inherit this class to implement a new tokenizer.

__init__()