Tokenizers

class nemo.collections.common.tokenizers.AutoTokenizer(pretrained_model_name: str, vocab_file: Optional[str] = None, merges_file: Optional[str] = None, mask_token: Optional[str] = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: Optional[str] = None, sep_token: Optional[str] = None, cls_token: Optional[str] = None, unk_token: Optional[str] = None, use_fast: Optional[bool] = False, trust_remote_code: Optional[bool] = False)

Wrapper of HuggingFace AutoTokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer.

__init__(pretrained_model_name: str, vocab_file: Optional[str] = None, merges_file: Optional[str] = None, mask_token: Optional[str] = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: Optional[str] = None, sep_token: Optional[str] = None, cls_token: Optional[str] = None, unk_token: Optional[str] = None, use_fast: Optional[bool] = False, trust_remote_code: Optional[bool] = False)
Parameters
  • pretrained_model_name – corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument. For more details please refer to https://huggingface.co/transformers/_modules/transformers/tokenization_auto.html#AutoTokenizer.from_pretrained. The list of all supported models can be found here: ALL_PRETRAINED_CONFIG_ARCHIVE_MAP

  • vocab_file – path to file with vocabulary which consists of characters separated by newlines.

  • mask_token – mask token

  • bos_token – the beginning of sequence token

  • eos_token – the end of sequence token. Usually equal to sep_token

  • pad_token – token to use for padding

  • sep_token – token used for separating sequences

  • cls_token – class token. Usually equal to bos_token

  • unk_token – token to use for unknown tokens

  • use_fast – whether to use fast HuggingFace tokenizer

class nemo.collections.common.tokenizers.SentencePieceTokenizer(model_path: str, special_tokens: Optional[Union[Dict[str, str], List[str]]] = None, legacy: bool = False)

Sentencepiecetokenizer https://github.com/google/sentencepiece.

Parameters
  • model_path – path to sentence piece tokenizer model. To create the model use create_spt_model()

  • special_tokens – either list of special tokens or dictionary of token name to token value

  • legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.

__init__(model_path: str, special_tokens: Optional[Union[Dict[str, str], List[str]]] = None, legacy: bool = False)
class nemo.collections.common.tokenizers.TokenizerSpec

Inherit this class to implement a new tokenizer.

__init__()