core.tokenizers.text.libraries.tiktoken_tokenizer#
Module Contents#
Classes#
TikTokenTokenizer https://github.com/openai/tiktoken. |
Functions#
Reload the tokenizer JSON file and convert it to Tiktoken format. |
Data#
API#
- core.tokenizers.text.libraries.tiktoken_tokenizer.PATTERN_TIKTOKEN_V1#
‘[^\\r\\n\\p{L}\\p{N}]?+\p{L}+|\p{N}| ?[^\\s\\p{L}\\p{N}]++[\r\n]|\s[\r\n]|\s+(?!\S)|\s+’
- core.tokenizers.text.libraries.tiktoken_tokenizer.PATTERN_TIKTOKEN_V2#
‘[^\\r\\n\\p{L}\\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\p{L…’
- core.tokenizers.text.libraries.tiktoken_tokenizer.DEFAULT_TIKTOKEN_MAX_VOCAB#
None
- core.tokenizers.text.libraries.tiktoken_tokenizer.SPECIAL_TOKENS#
[’
’, ‘ ’, ‘’, ‘’, ‘ ’, ‘ ’, ‘ ’]
- core.tokenizers.text.libraries.tiktoken_tokenizer.SPECIAL_TOKEN_TEMPLATE#
‘<SPECIAL_{id}>’
- core.tokenizers.text.libraries.tiktoken_tokenizer.reload_mergeable_ranks(
- path: str,
- max_vocab: Optional[int] = None,
- num_special_tokens: Optional[int] = None,
Reload the tokenizer JSON file and convert it to Tiktoken format.
- Parameters:
path (str) – path to the tokenizer.
max_vocab (Optional[int]) – maximum size of vocabulary.
num_special_tokens (Optional[int]) – number of added special tokens.
- Returns:
reloaded tokenizer vocab.
- Return type:
Dict[bytes, int]
- class core.tokenizers.text.libraries.tiktoken_tokenizer.TikTokenTokenizer(
- tokenizer_path: str,
- special_tokens: Optional[List[str]] = None,
- num_special_tokens: Optional[int] = 1000,
- chat_template: Optional[str] = None,
- pattern: Optional[str] = 'v2',
- vocab_size: Optional[int] = DEFAULT_TIKTOKEN_MAX_VOCAB,
Bases:
core.tokenizers.text.libraries.abstract_tokenizer.MegatronTokenizerTextAbstract,core.tokenizers.text.libraries.chat_template.MegatronTokenizerChatTemplateTikTokenTokenizer https://github.com/openai/tiktoken.
Initialization
- Parameters:
tokenizer_path (str) – path to tokenizer vocabulary.
special_tokens (Optional[List[str]]) – template for user-defined special tokens.
num_special_tokens (int) – number of special tokens to generate.
chat_template (Optional[str]) – tokenizer chat template in jinja format.
pattern (Optional[str]) – regex pattern to split the text.
vocab_size (Optional[int]) – size of vocabulary.
- text_to_tokens(text: str) List[str]#
Converts text to tokens.
- tokens_to_text(tokens: List[int]) str#
Converts list of tokens to text.
- token_to_id(token: str) int#
Converts a single token to it’s id.
- tokens_to_ids(tokens: List[str]) List[int]#
Converts list of tokens to list of it’s ids.
- id_to_token(token_id: int) str#
Converts token id to token.
- ids_to_tokens(token_ids: List[int]) List[str]#
Converts list of tokens ids to list of tokens.
- text_to_ids(text: str) List[int]#
Converts text to list of ids.
- ids_to_text(
- tokens: List[int],
- remove_special_tokens: bool = False,
Converts list of ids to text.
- abstractmethod add_special_tokens(special_tokens_dict: dict)#
Adds special tokens to the tokenizer.
- property additional_special_tokens_ids: list#
Returns a list of the additional special tokens, excluding [bos, eos, pad, unk] and special_filler. Used to return sentinel tokens for e.g. T5.
- property bos_id: int#
Returns id of beginning of sentence token.
- property eos_id: int#
Returns id of end of sentence token.
- property eod: int#
Returns id of end of document token.
- property unk_id: int#
Returns id of unknown tokens.
- property mask_id: int#
Returns id of mask token.
- property pad_id: int#
Returns id of padding token.
- property cls_id: int#
Returns id of classification token.
- property sep_id: int#
Returns id of SEP token.
- property vocab#
Returns tokenizer vocab.
- property decoder#
- property encoder#
- property vocab_size: int#
Returns tokenizer vocab size.
- property inv_vocab: dict#
Returns tokenizer vocab with reversed keys and values.