Tokenizers#

class nemo.collections.common.tokenizers.AutoTokenizer(pretrained_model_name: str, vocab_file: Optional[str] = None, merges_file: Optional[str] = None, mask_token: Optional[str] = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: Optional[str] = None, sep_token: Optional[str] = None, cls_token: Optional[str] = None, unk_token: Optional[str] = None, use_fast: Optional[bool] = False)[source]#

Bases: TokenizerSpec

Wrapper of HuggingFace AutoTokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer.

__init__(pretrained_model_name: str, vocab_file: Optional[str] = None, merges_file: Optional[str] = None, mask_token: Optional[str] = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: Optional[str] = None, sep_token: Optional[str] = None, cls_token: Optional[str] = None, unk_token: Optional[str] = None, use_fast: Optional[bool] = False)[source]#
Args:
pretrained_model_name: corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument.

For more details please refer to https://huggingface.co/transformers/_modules/transformers/tokenization_auto.html#AutoTokenizer.from_pretrained. The list of all supported models can be found here: ALL_PRETRAINED_CONFIG_ARCHIVE_MAP

vocab_file: path to file with vocabulary which consists

of characters separated by ‘

‘.

mask_token: mask token bos_token: the beginning of sequence token eos_token: the end of sequence token. Usually equal to sep_token pad_token: token to use for padding sep_token: token used for separating sequences cls_token: class token. Usually equal to bos_token unk_token: token to use for unknown tokens use_fast: whether to use fast HuggingFace tokenizer

add_special_tokens(special_tokens_dict: dict) int[source]#

Adds a dictionary of special tokens (eos, pad, cls…). If special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the current vocabulary). :param special_tokens_dict: dict of string. Keys should be in the list of predefined special attributes:

[bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens].

Parameters

vocabulary. (Tokens are only added if they are not already in the) –

Returns

Number of tokens added to the vocabulary.

property additional_special_tokens_ids#

Returns a list of the additional special tokens (excluding bos, eos, pad, unk). Used to return sentinel tokens for e.g. T5.

property bos_id#
property cls_id#
property eod#

Returns EOS token id. Exact copy of the eos_id function. Required for megatron-core.

property eos_id#
ids_to_text(ids)[source]#
ids_to_tokens(ids)[source]#
property mask_id#
property name#
property pad_id#
save_vocabulary(save_directory: str, filename_prefix: Optional[str] = None)[source]#

Saves tokenizer’s vocabulary and other artifacts to the specified directory

property sep_id#
text_to_ids(text)[source]#
text_to_tokens(text)[source]#
token_to_id(token)[source]#
tokens_to_ids(tokens)[source]#
tokens_to_text(tokens)[source]#
property unk_id#
property vocab#
property vocab_size#
class nemo.collections.common.tokenizers.SentencePieceTokenizer(model_path: str, special_tokens: Optional[Union[Dict[str, str], List[str]]] = None, legacy: bool = False)[source]#

Bases: TokenizerSpec

Sentencepiecetokenizer google/sentencepiece.

Args: model_path: path to sentence piece tokenizer model. To create the model use create_spt_model() special_tokens: either list of special tokens or dictionary of token name to token value legacy: when set to True, the previous behavior of the SentecePiece wrapper will be restored,

including the possibility to add special tokens inside wrapper.

__init__(model_path: str, special_tokens: Optional[Union[Dict[str, str], List[str]]] = None, legacy: bool = False)[source]#
add_special_tokens(special_tokens)[source]#
property additional_special_tokens_ids#

Returns a list of the additional special tokens (excluding bos, eos, pad, unk). Used to return sentinel tokens for e.g. T5.

property bos_id#
property cls_id#
property eos_id#
ids_to_text(ids)[source]#
ids_to_tokens(ids)[source]#
property mask_id#
property pad_id#
property sep_id#
text_to_ids(text)[source]#
text_to_tokens(text)[source]#
token_to_id(token)[source]#
tokens_to_ids(tokens: Union[str, List[str]]) Union[int, List[int]][source]#
tokens_to_text(tokens)[source]#
property unk_id#
property vocab#
class nemo.collections.common.tokenizers.TokenizerSpec[source]#

Bases: ABC

Inherit this class to implement a new tokenizer.

add_special_tokens(special_tokens: List[str])[source]#
property bos#

Property alias to match MegatronTokenizer; returns bos_id if available.

property cls#

Property alias to match MegatronTokenizer; returns cls_id if available.

property eod#

Property alias to match MegatronTokenizer; returns eod_id if available.

property eos#

Property alias to match MegatronTokenizer; returns eos_id if available.

abstract ids_to_text(ids)[source]#
abstract ids_to_tokens(ids)[source]#
property mask#

Property alias to match MegatronTokenizer; returns mask_id if available.

property name#
property pad#

Property alias to match MegatronTokenizer; returns pad_id if available.

property sep#

Property alias to match MegatronTokenizer; returns sep_id if available.

abstract text_to_ids(text)[source]#
abstract text_to_tokens(text)[source]#
abstract tokens_to_ids(tokens)[source]#
abstract tokens_to_text(tokens)[source]#
property unique_identifiers#

Property required for use with megatron-core datasets.