`nemo_export.sentencepiece_tokenizer`#

Module Contents#

Classes#

SentencePieceTokenizer

SentencePieceTokenizer (https://github.com/google/sentencepiece).

API#

class nemo_export.sentencepiece_tokenizer.SentencePieceTokenizer( model_path: Optional[str] = None, special_tokens: Optional[Union[Dict[str, str], List[str]]] = None, legacy: bool = False, tokenizer: Optional[sentencepiece.SentencePieceProcessor] = None, )#

SentencePieceTokenizer (https://github.com/google/sentencepiece).

Parameters:

model_path – path to sentence piece tokenizer model.
special_tokens – either list of special tokens or dictionary of token name to token value
legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.
tokenizer – wraps an existing tokenizer

Initialization

text_to_tokens(text)#

encode(text)#

tokens_to_text(tokens)#

batch_decode(ids)#

token_to_id(token)#

ids_to_tokens(ids)#

tokens_to_ids( tokens: Union[str, List[str]], ) → Union[int, List[int]]#

add_special_tokens(special_tokens)#

property pad_id#

property bos_token_id#

property eos_token_id#

property sep_id#

property cls_id#

property mask_id#

property unk_id#

property additional_special_tokens_ids#: Returns a list of the additional special tokens (excluding bos, eos, pad, unk). Used to return sentinel tokens for e.g. T5.

property vocab#

convert_ids_to_tokens(ids, skip_special_tokens: bool = False)#

convert_tokens_to_string(tokens: List[str])#

__len__()#

property is_fast#

get_added_vocab()#

nemo_export.sentencepiece_tokenizer#

Module Contents#

Classes#

API#

`nemo_export.sentencepiece_tokenizer`#