Tokenizers#
- class nemo.collections.common.tokenizers.AutoTokenizer(pretrained_model_name: str, vocab_file: Optional[str] = None, merges_file: Optional[str] = None, mask_token: Optional[str] = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: Optional[str] = None, sep_token: Optional[str] = None, cls_token: Optional[str] = None, unk_token: Optional[str] = None, use_fast: Optional[bool] = False)[source]#
Bases:
TokenizerSpec
Wrapper of HuggingFace AutoTokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer.
- __init__(pretrained_model_name: str, vocab_file: Optional[str] = None, merges_file: Optional[str] = None, mask_token: Optional[str] = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: Optional[str] = None, sep_token: Optional[str] = None, cls_token: Optional[str] = None, unk_token: Optional[str] = None, use_fast: Optional[bool] = False)[source]#
- Args:
- pretrained_model_name: corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument.
For more details please refer to https://huggingface.co/transformers/_modules/transformers/tokenization_auto.html#AutoTokenizer.from_pretrained. The list of all supported models can be found here: ALL_PRETRAINED_CONFIG_ARCHIVE_MAP
- vocab_file: path to file with vocabulary which consists
of characters separated by ‘
- ‘.
mask_token: mask token bos_token: the beginning of sequence token eos_token: the end of sequence token. Usually equal to sep_token pad_token: token to use for padding sep_token: token used for separating sequences cls_token: class token. Usually equal to bos_token unk_token: token to use for unknown tokens use_fast: whether to use fast HuggingFace tokenizer
- add_special_tokens(special_tokens_dict: dict) int [source]#
Adds a dictionary of special tokens (eos, pad, cls…). If special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the current vocabulary). :param special_tokens_dict: dict of string. Keys should be in the list of predefined special attributes:
[
bos_token
,eos_token
,unk_token
,sep_token
,pad_token
,cls_token
,mask_token
,additional_special_tokens
].- Parameters
vocabulary. (Tokens are only added if they are not already in the) –
- Returns
Number of tokens added to the vocabulary.
- property additional_special_tokens_ids#
Returns a list of the additional special tokens (excluding bos, eos, pad, unk). Used to return sentinel tokens for e.g. T5.
- property bos_id#
- property cls_id#
- property eos_id#
- property mask_id#
- property name#
- property pad_id#
- save_vocabulary(save_directory: str, filename_prefix: Optional[str] = None)[source]#
Saves tokenizer’s vocabulary and other artifacts to the specified directory
- property sep_id#
- property unk_id#
- property vocab#
- property vocab_size#
- class nemo.collections.common.tokenizers.SentencePieceTokenizer(model_path: str, special_tokens: Optional[Union[Dict[str, str], List[str]]] = None, legacy: bool = False)[source]#
Bases:
TokenizerSpec
Sentencepiecetokenizer google/sentencepiece.
Args: model_path: path to sentence piece tokenizer model. To create the model use create_spt_model() special_tokens: either list of special tokens or dictionary of token name to token value legacy: when set to True, the previous behavior of the SentecePiece wrapper will be restored,
including the possibility to add special tokens inside wrapper.
- __init__(model_path: str, special_tokens: Optional[Union[Dict[str, str], List[str]]] = None, legacy: bool = False)[source]#
- property additional_special_tokens_ids#
Returns a list of the additional special tokens (excluding bos, eos, pad, unk). Used to return sentinel tokens for e.g. T5.
- property bos_id#
- property cls_id#
- property eos_id#
- property mask_id#
- property pad_id#
- property sep_id#
- property unk_id#
- property vocab#