nemo_export.tiktoken_tokenizer#

Module Contents#

Classes#

Functions#

reload_mergeable_ranks

Reload the tokenizer JSON file and convert it to Tiktoken format.

Data#

API#

nemo_export.tiktoken_tokenizer.PATTERN_TIKTOKEN = '[^\\\\r\\\\n\\\\p{L}\\\\p{N}]?[\\\\p{Lu}\\\\p{Lt}\\\\p{Lm}\\\\p{Lo}\\\\p{M}]*[\\\\p{Ll}\\\\p{Lm}\\\\p{Lo}\\\\p{M}]+|[^\\\\r\\\\n\\\\...'#
nemo_export.tiktoken_tokenizer.DEFAULT_TIKTOKEN_MAX_VOCAB = None#
nemo_export.tiktoken_tokenizer.SPECIAL_TOKENS = ['<unk>', '<s>', '</s>']#
nemo_export.tiktoken_tokenizer.SPECIAL_TOKEN_TEMPLATE = '<SPECIAL_{id}>'#
nemo_export.tiktoken_tokenizer.reload_mergeable_ranks(
path: str,
max_vocab: Optional[int] = None,
) Dict[bytes, int]#

Reload the tokenizer JSON file and convert it to Tiktoken format.

class nemo_export.tiktoken_tokenizer.TiktokenTokenizer(vocab_file: str)#

Initialization

encode(text)#
decode(tokens)#
batch_decode(ids)#
property pad_id#
property bos_token_id#
property eos_token_id#