nemo_export.sentencepiece_tokenizer#

Module Contents#

Classes#

SentencePieceTokenizer

SentencePieceTokenizer (https://github.com/google/sentencepiece).

API#

class nemo_export.sentencepiece_tokenizer.SentencePieceTokenizer(
model_path: Optional[str] = None,
special_tokens: Optional[Union[Dict[str, str], List[str]]] = None,
legacy: bool = False,
tokenizer: Optional[sentencepiece.SentencePieceProcessor] = None,
)#

SentencePieceTokenizer (https://github.com/google/sentencepiece).

Parameters:
  • model_path – path to sentence piece tokenizer model.

  • special_tokens – either list of special tokens or dictionary of token name to token value

  • legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.

  • tokenizer – wraps an existing tokenizer

Initialization

text_to_tokens(text)#
encode(text)#
tokens_to_text(tokens)#
batch_decode(ids)#
token_to_id(token)#
ids_to_tokens(ids)#
tokens_to_ids(
tokens: Union[str, List[str]],
) Union[int, List[int]]#
add_special_tokens(special_tokens)#
property pad_id#
property bos_token_id#
property eos_token_id#
property sep_id#
property cls_id#
property mask_id#
property unk_id#
property additional_special_tokens_ids#

Returns a list of the additional special tokens (excluding bos, eos, pad, unk). Used to return sentinel tokens for e.g. T5.

property vocab#
convert_ids_to_tokens(ids, skip_special_tokens: bool = False)#
convert_tokens_to_string(tokens: List[str])#
__len__()#
property is_fast#
get_added_vocab()#