core.tokenizers.text.libraries.tiktoken_tokenizer#

Module Contents#

Classes#

TikTokenTokenizer

TikTokenTokenizer https://github.com/openai/tiktoken.

Functions#

reload_mergeable_ranks

Reload the tokenizer JSON file and convert it to Tiktoken format.

Data#

API#

core.tokenizers.text.libraries.tiktoken_tokenizer.PATTERN_TIKTOKEN_V1#

[^\\r\\n\\p{L}\\p{N}]?+\p{L}+|\p{N}| ?[^\\s\\p{L}\\p{N}]++[\r\n]|\s[\r\n]|\s+(?!\S)|\s+’

core.tokenizers.text.libraries.tiktoken_tokenizer.PATTERN_TIKTOKEN_V2#

[^\\r\\n\\p{L}\\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\p{L…’

core.tokenizers.text.libraries.tiktoken_tokenizer.DEFAULT_TIKTOKEN_MAX_VOCAB#

None

core.tokenizers.text.libraries.tiktoken_tokenizer.SPECIAL_TOKENS#

[’’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’]

core.tokenizers.text.libraries.tiktoken_tokenizer.SPECIAL_TOKEN_TEMPLATE#

‘<SPECIAL_{id}>’

core.tokenizers.text.libraries.tiktoken_tokenizer.reload_mergeable_ranks(
path: str,
max_vocab: Optional[int] = None,
num_special_tokens: Optional[int] = None,
) Dict[bytes, int]#

Reload the tokenizer JSON file and convert it to Tiktoken format.

Parameters:
  • path (str) – path to the tokenizer.

  • max_vocab (Optional[int]) – maximum size of vocabulary.

  • num_special_tokens (Optional[int]) – number of added special tokens.

Returns:

reloaded tokenizer vocab.

Return type:

Dict[bytes, int]

class core.tokenizers.text.libraries.tiktoken_tokenizer.TikTokenTokenizer(
tokenizer_path: str,
special_tokens: Optional[List[str]] = None,
num_special_tokens: Optional[int] = 1000,
chat_template: Optional[str] = None,
pattern: Optional[str] = 'v2',
vocab_size: Optional[int] = DEFAULT_TIKTOKEN_MAX_VOCAB,
)#

Bases: core.tokenizers.text.libraries.abstract_tokenizer.MegatronTokenizerTextAbstract, core.tokenizers.text.libraries.chat_template.MegatronTokenizerChatTemplate

TikTokenTokenizer https://github.com/openai/tiktoken.

Initialization

Parameters:
  • tokenizer_path (str) – path to tokenizer vocabulary.

  • special_tokens (Optional[List[str]]) – template for user-defined special tokens.

  • num_special_tokens (int) – number of special tokens to generate.

  • chat_template (Optional[str]) – tokenizer chat template in jinja format.

  • pattern (Optional[str]) – regex pattern to split the text.

  • vocab_size (Optional[int]) – size of vocabulary.

text_to_tokens(text: str) List[str]#

Converts text to tokens.

tokens_to_text(tokens: List[int]) str#

Converts list of tokens to text.

token_to_id(token: str) int#

Converts a single token to it’s id.

tokens_to_ids(tokens: List[str]) List[int]#

Converts list of tokens to list of it’s ids.

id_to_token(token_id: int) str#

Converts token id to token.

ids_to_tokens(token_ids: List[int]) List[str]#

Converts list of tokens ids to list of tokens.

text_to_ids(text: str) List[int]#

Converts text to list of ids.

ids_to_text(
tokens: List[int],
remove_special_tokens: bool = False,
) str#

Converts list of ids to text.

abstractmethod add_special_tokens(special_tokens_dict: dict)#

Adds special tokens to the tokenizer.

property additional_special_tokens_ids: list#

Returns a list of the additional special tokens, excluding [bos, eos, pad, unk] and special_filler. Used to return sentinel tokens for e.g. T5.

property bos_id: int#

Returns id of beginning of sentence token.

property eos_id: int#

Returns id of end of sentence token.

property eod: int#

Returns id of end of document token.

property unk_id: int#

Returns id of unknown tokens.

property mask_id: int#

Returns id of mask token.

property pad_id: int#

Returns id of padding token.

property cls_id: int#

Returns id of classification token.

property sep_id: int#

Returns id of SEP token.

property vocab#

Returns tokenizer vocab.

property decoder#
property encoder#
property vocab_size: int#

Returns tokenizer vocab size.

property inv_vocab: dict#

Returns tokenizer vocab with reversed keys and values.