`core.tokenizers.text.libraries.bytelevel_tokenizer`#

Module Contents#

Classes#

ByteLevelTokenizer

A byte-level tokenizer that encodes text as UTF-8 bytes with user control over the EOS, BOS, and PAD tokens as well as the vocabulary size and a mapping of other special tokens to their IDs.

API#

class core.tokenizers.text.libraries.bytelevel_tokenizer.ByteLevelTokenizer( special_tokens: Optional[Union[Dict[str, str], List[str]]] = None, vocab_size: int = 512, _eos_id: int = 0, _pad_id: int = 1, _bos_id: int = None, )#

Bases: core.tokenizers.text.libraries.abstract_tokenizer.MegatronTokenizerTextAbstract

A byte-level tokenizer that encodes text as UTF-8 bytes with user control over the EOS, BOS, and PAD tokens as well as the vocabulary size and a mapping of other special tokens to their IDs.

Initialization

A byte-level tokenizer that encodes text as UTF-8 bytes.

This tokenizer treats each byte as a token, with a default vocabulary size of 512 to accommodate UTF-8 byte values (0-255) plus special tokens. It can handle arbitrary text input by encoding it into bytes.

Parameters:

special_tokens – Dictionary or list of special tokens to add to the vocabulary. These tokens will be assigned IDs at the end of the vocabulary. Defaults to None.
vocab_size – Size of the vocabulary, should be at least 256 to handle all byte values. Special tokens will be added after this size. Defaults to 512.
_eos_id – ID to use for the end-of-sequence token. Defaults to 0.
_pad_id – ID to use for the padding token. Defaults to 1.
_bos_id – ID to use for the beginning-of-sequence token. Defaults to None.

text_to_tokens(text)#: Convert a text to a list of tokens.

tokens_to_text(tokens)#: Convert a list of tokens to a text.

text_to_ids(text)#: Convert a text to a list of IDs.

ids_to_text(ids)#: Convert a list of IDs to a text.

tokens_to_ids(tokens)#: Convert a list of tokens to a list of IDs.

ids_to_tokens(ids)#: Convert a list of IDs to a list of tokens.

token_to_id(token)#: Convert a token to its corresponding ID.

id_to_token(id)#: Convert an ID to its corresponding token.

abstractmethod add_special_tokens(special_tokens: Union[list, dict]) → None#: Adds special tokens to the tokenizer.

property pad_id#: Get the padding ID.

property bos_id#: Get the beginning-of-sequence ID.

property eos_id#: Get the end-of-sequence ID.

property eod#: Get the end-of-document ID.

property unk_id#: Get the unknown ID.

core.tokenizers.text.libraries.bytelevel_tokenizer#

Module Contents#

Classes#

API#

`core.tokenizers.text.libraries.bytelevel_tokenizer`#