Tokenizers

class nemo.collections.common.tokenizers.AutoTokenizer(pretrained_model_name: str, vocab_file: Optional[str] = None, merges_file: Optional[str] = None, mask_token: Optional[str] = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: Optional[str] = None, sep_token: Optional[str] = None, cls_token: Optional[str] = None, unk_token: Optional[str] = None, use_fast: Optional[bool] = False, trust_remote_code: Optional[bool] = False)

Wrapper of HuggingFace AutoTokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer.

__init__(pretrained_model_name: str, vocab_file: Optional[str] = None, merges_file: Optional[str] = None, mask_token: Optional[str] = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: Optional[str] = None, sep_token: Optional[str] = None, cls_token: Optional[str] = None, unk_token: Optional[str] = None, use_fast: Optional[bool] = False, trust_remote_code: Optional[bool] = False)
Args:
pretrained_model_name: corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument.

For more details please refer to https://huggingface.co/transformers/_modules/transformers/tokenization_auto.html#AutoTokenizer.from_pretrained. The list of all supported models can be found here: ALL_PRETRAINED_CONFIG_ARCHIVE_MAP

vocab_file: path to file with vocabulary which consists

of characters separated by ‘

‘.

mask_token: mask token bos_token: the beginning of sequence token eos_token: the end of sequence token. Usually equal to sep_token pad_token: token to use for padding sep_token: token used for separating sequences cls_token: class token. Usually equal to bos_token unk_token: token to use for unknown tokens use_fast: whether to use fast HuggingFace tokenizer

class nemo.collections.common.tokenizers.SentencePieceTokenizer(model_path: str, special_tokens: Optional[Union[Dict[str, str], List[str]]] = None, legacy: bool = False)

Sentencepiecetokenizer https://github.com/google/sentencepiece.

Parameters
  • model_path – path to sentence piece tokenizer model. To create the model use create_spt_model()

  • special_tokens – either list of special tokens or dictionary of token name to token value

  • legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.

__init__(model_path: str, special_tokens: Optional[Union[Dict[str, str], List[str]]] = None, legacy: bool = False)

class nemo.collections.common.tokenizers.TokenizerSpec

Inherit this class to implement a new tokenizer.

__init__()

Previous Metrics
Next Data
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.