nemo_export.sentencepiece_tokenizer
#
Module Contents#
Classes#
SentencePieceTokenizer (https://github.com/google/sentencepiece). |
API#
- class nemo_export.sentencepiece_tokenizer.SentencePieceTokenizer(
- model_path: Optional[str] = None,
- special_tokens: Optional[Union[Dict[str, str], List[str]]] = None,
- legacy: bool = False,
- tokenizer: Optional[sentencepiece.SentencePieceProcessor] = None,
SentencePieceTokenizer (https://github.com/google/sentencepiece).
- Parameters:
model_path – path to sentence piece tokenizer model.
special_tokens – either list of special tokens or dictionary of token name to token value
legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.
tokenizer – wraps an existing tokenizer
Initialization
- text_to_tokens(text)#
- encode(text)#
- tokens_to_text(tokens)#
- batch_decode(ids)#
- token_to_id(token)#
- ids_to_tokens(ids)#
- tokens_to_ids(
- tokens: Union[str, List[str]],
- add_special_tokens(special_tokens)#
- property pad_id#
- property bos_token_id#
- property eos_token_id#
- property sep_id#
- property cls_id#
- property mask_id#
- property unk_id#
- property additional_special_tokens_ids#
Returns a list of the additional special tokens (excluding bos, eos, pad, unk). Used to return sentinel tokens for e.g. T5.
- property vocab#
- convert_ids_to_tokens(ids, skip_special_tokens: bool = False)#
- convert_tokens_to_string(tokens: List[str])#
- __len__()#
- property is_fast#
- get_added_vocab()#