core.tokenizers.text.libraries.bytelevel_tokenizer#

Module Contents#

Classes#

ByteLevelTokenizer

A byte-level tokenizer that encodes text as UTF-8 bytes with user control over the EOS, BOS, and PAD tokens as well as the vocabulary size and a mapping of other special tokens to their IDs.

API#

class core.tokenizers.text.libraries.bytelevel_tokenizer.ByteLevelTokenizer(
special_tokens: Optional[Union[Dict[str, str], List[str]]] = None,
vocab_size: int = 512,
_eos_id: int = 0,
_pad_id: int = 1,
_bos_id: int = None,
)#

Bases: core.tokenizers.text.libraries.abstract_tokenizer.MegatronTokenizerTextAbstract

A byte-level tokenizer that encodes text as UTF-8 bytes with user control over the EOS, BOS, and PAD tokens as well as the vocabulary size and a mapping of other special tokens to their IDs.

Initialization

A byte-level tokenizer that encodes text as UTF-8 bytes.

This tokenizer treats each byte as a token, with a default vocabulary size of 512 to accommodate UTF-8 byte values (0-255) plus special tokens. It can handle arbitrary text input by encoding it into bytes.

Parameters:
  • special_tokens – Dictionary or list of special tokens to add to the vocabulary. These tokens will be assigned IDs at the end of the vocabulary. Defaults to None.

  • vocab_size – Size of the vocabulary, should be at least 256 to handle all byte values. Special tokens will be added after this size. Defaults to 512.

  • _eos_id – ID to use for the end-of-sequence token. Defaults to 0.

  • _pad_id – ID to use for the padding token. Defaults to 1.

  • _bos_id – ID to use for the beginning-of-sequence token. Defaults to None.

text_to_tokens(text)#

Convert a text to a list of tokens.

tokens_to_text(tokens)#

Convert a list of tokens to a text.

text_to_ids(text)#

Convert a text to a list of IDs.

ids_to_text(ids)#

Convert a list of IDs to a text.

tokens_to_ids(tokens)#

Convert a list of tokens to a list of IDs.

ids_to_tokens(ids)#

Convert a list of IDs to a list of tokens.

token_to_id(token)#

Convert a token to its corresponding ID.

id_to_token(id)#

Convert an ID to its corresponding token.

abstractmethod add_special_tokens(special_tokens: Union[list, dict]) None#

Adds special tokens to the tokenizer.

property pad_id#

Get the padding ID.

property bos_id#

Get the beginning-of-sequence ID.

property eos_id#

Get the end-of-sequence ID.

property eod#

Get the end-of-document ID.

property unk_id#

Get the unknown ID.