core.tokenizers.text.libraries.sentencepiece_tokenizer#

Module Contents#

Classes#

SentencePieceTokenizer

Sentencepiecetokenizer https://github.com/google/sentencepiece.

API#

class core.tokenizers.text.libraries.sentencepiece_tokenizer.SentencePieceTokenizer(
tokenizer_path: str,
special_tokens: Optional[Union[Dict[str, str], List[str]]] = None,
legacy: bool = False,
ignore_extra_whitespaces: bool = True,
chat_template: Optional[str] = None,
trim_spm_separator_after_special_token=True,
spm_separator='▁',
)#

Bases: core.tokenizers.text.libraries.abstract_tokenizer.MegatronTokenizerTextAbstract, core.tokenizers.text.libraries.chat_template.MegatronTokenizerChatTemplate

Sentencepiecetokenizer https://github.com/google/sentencepiece.

Initialization

Parameters:
  • tokenizer_path (str) – path to sentence piece tokenizer model.

  • special_tokens (Optional[Union[Dict[str, str], List[str]]]) – either list of special tokens or dictionary of token name to token value

  • legacy (bool) – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.

  • ignore_extra_whitespaces (bool) –

    whether to ignore extra whitespaces in the input text while encoding. .. note::

    This is done for the current models tokenizers that don’t handle extra whitespaces as by default tokenizer learned to ignore it. To check if the tokenizer by default ignores extra whitespaces refer to self.removed_extra_spaces attribute of the tokenizer. We added a parameter to process_asr_tokenizer.py for upcoming models to handle it inbuilt.

  • chat_template (Optional[str]) – tokenizer chat template in jinja format.

text_to_tokens(text: str) List[str]#

Converts text to tokens.

text_to_ids(text, sample_alpha=None) List[int]#

Converts text to tokens ids.

_text_to_ids(text, sample_alpha=None) List[int]#

Converts text to tokens ids.

_text_to_ids_extra_space(text, sample_alpha=None) List[int]#

Converts text to tokens ids.

tokens_to_text(tokens: List[str]) str#

Converts list of tokens text.

ids_to_text(ids: List[int]) str#

Converts list of ids to text.

token_to_id(token: str) int#

Converts a single token to it’s id.

ids_to_tokens(ids: List[int]) List[str]#

Converts list of tokens ids to it’s token values.

tokens_to_ids(
tokens: Union[str, List[str]],
tokens_to_skip: List[str] = [],
) List[int]#

Converts list of tokens to it’s ids.

add_special_tokens(special_tokens: Union[list, dict]) None#

Adds special tokens to the tokenizer.

property pad_id: int#

Returns id of padding token.

property bos_id: int#

Returns id of begginning of sentence token.

property eos_id: int#

Returns id of end of sentence token.

property sep_id: int#

Returns id of end of SEP token.

property cls_id: int#

Returns id of classification token.

property mask_id: int#

Returns id of mask token.

property unk_id: int#

Returns id of unknown tokens.

property additional_special_tokens_ids: list#

Returns a list of the additional special tokens (excluding bos, eos, pad, unk). Used to return sentinel tokens for e.g. T5.

property vocab: list#

Returns tokenizer’s vocabulary.

property inv_vocab: dict#

Returns tokenizer vocab with reversed keys and values.