core.tokenizers.text.libraries.bytelevel_tokenizer#
Module Contents#
Classes#
A byte-level tokenizer that encodes text as UTF-8 bytes with user control over the EOS, BOS, and PAD tokens as well as the vocabulary size and a mapping of other special tokens to their IDs. |
API#
- class core.tokenizers.text.libraries.bytelevel_tokenizer.ByteLevelTokenizer(
- special_tokens: Optional[Union[Dict[str, str], List[str]]] = None,
- vocab_size: int = 512,
- _eos_id: int = 0,
- _pad_id: int = 1,
- _bos_id: int = None,
Bases:
core.tokenizers.text.libraries.abstract_tokenizer.MegatronTokenizerTextAbstractA byte-level tokenizer that encodes text as UTF-8 bytes with user control over the EOS, BOS, and PAD tokens as well as the vocabulary size and a mapping of other special tokens to their IDs.
Initialization
A byte-level tokenizer that encodes text as UTF-8 bytes.
This tokenizer treats each byte as a token, with a default vocabulary size of 512 to accommodate UTF-8 byte values (0-255) plus special tokens. It can handle arbitrary text input by encoding it into bytes.
- Parameters:
special_tokens – Dictionary or list of special tokens to add to the vocabulary. These tokens will be assigned IDs at the end of the vocabulary. Defaults to None.
vocab_size – Size of the vocabulary, should be at least 256 to handle all byte values. Special tokens will be added after this size. Defaults to 512.
_eos_id – ID to use for the end-of-sequence token. Defaults to 0.
_pad_id – ID to use for the padding token. Defaults to 1.
_bos_id – ID to use for the beginning-of-sequence token. Defaults to None.
- text_to_tokens(text)#
Convert a text to a list of tokens.
- tokens_to_text(tokens)#
Convert a list of tokens to a text.
- text_to_ids(text)#
Convert a text to a list of IDs.
- ids_to_text(ids)#
Convert a list of IDs to a text.
- tokens_to_ids(tokens)#
Convert a list of tokens to a list of IDs.
- ids_to_tokens(ids)#
Convert a list of IDs to a list of tokens.
- token_to_id(token)#
Convert a token to its corresponding ID.
- id_to_token(id)#
Convert an ID to its corresponding token.
- abstractmethod add_special_tokens(special_tokens: Union[list, dict]) None#
Adds special tokens to the tokenizer.
- property pad_id#
Get the padding ID.
- property bos_id#
Get the beginning-of-sequence ID.
- property eos_id#
Get the end-of-sequence ID.
- property eod#
Get the end-of-document ID.
- property unk_id#
Get the unknown ID.