- class nemo.collections.common.tokenizers.AutoTokenizer(pretrained_model_name: str, vocab_file: Optional[str] = None, merges_file: Optional[str] = None, mask_token: Optional[str] = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: Optional[str] = None, sep_token: Optional[str] = None, cls_token: Optional[str] = None, unk_token: Optional[str] = None, use_fast: Optional[bool] = False, trust_remote_code: Optional[bool] = False)
Wrapper of HuggingFace AutoTokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer.
- __init__(pretrained_model_name: str, vocab_file: Optional[str] = None, merges_file: Optional[str] = None, mask_token: Optional[str] = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: Optional[str] = None, sep_token: Optional[str] = None, cls_token: Optional[str] = None, unk_token: Optional[str] = None, use_fast: Optional[bool] = False, trust_remote_code: Optional[bool] = False)
- Args:
-
- pretrained_model_name: corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument.
- vocab_file: path to file with vocabulary which consists
For more details please refer to https://huggingface.co/transformers/_modules/transformers/tokenization_auto.html#AutoTokenizer.from_pretrained. The list of all supported models can be found here: ALL_PRETRAINED_CONFIG_ARCHIVE_MAP
of characters separated by ‘
- ‘.
mask_token: mask token bos_token: the beginning of sequence token eos_token: the end of sequence token. Usually equal to sep_token pad_token: token to use for padding sep_token: token used for separating sequences cls_token: class token. Usually equal to bos_token unk_token: token to use for unknown tokens use_fast: whether to use fast HuggingFace tokenizer
- class nemo.collections.common.tokenizers.SentencePieceTokenizer(model_path: str, special_tokens: Optional[Union[Dict[str, str], List[str]]] = None, legacy: bool = False)
Sentencepiecetokenizer https://github.com/google/sentencepiece.
- Parameters
model_path – path to sentence piece tokenizer model. To create the model use create_spt_model()
special_tokens – either list of special tokens or dictionary of token name to token value
legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.
- __init__(model_path: str, special_tokens: Optional[Union[Dict[str, str], List[str]]] = None, legacy: bool = False)
- class nemo.collections.common.tokenizers.TokenizerSpec
Inherit this class to implement a new tokenizer.
- __init__()