bridge.training.tokenizers.gpt2_tokenization
#
Tokenization classes for OpenAI GPT.
Module Contents#
Classes#
GPT-2 BPE tokenizer. Peculiarities: - Byte-level BPE |
Functions#
Returns list of utf-8 byte and a corresponding list of unicode strings. The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When you’re at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a signficant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings. And avoids mapping to whitespace/control characters the bpe code barfs on. |
|
Return set of symbol pairs in a word. |
Data#
API#
- bridge.training.tokenizers.gpt2_tokenization.logger#
‘getLogger(…)’
- bridge.training.tokenizers.gpt2_tokenization.PRETRAINED_VOCAB_ARCHIVE_MAP#
None
- bridge.training.tokenizers.gpt2_tokenization.PRETRAINED_MERGES_ARCHIVE_MAP#
None
- bridge.training.tokenizers.gpt2_tokenization.PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP#
None
- bridge.training.tokenizers.gpt2_tokenization.VOCAB_NAME#
‘vocab.json’
- bridge.training.tokenizers.gpt2_tokenization.MERGES_NAME#
‘merges.txt’
- bridge.training.tokenizers.gpt2_tokenization.SPECIAL_TOKENS_NAME#
‘special_tokens.txt’
- bridge.training.tokenizers.gpt2_tokenization.bytes_to_unicode()#
Returns list of utf-8 byte and a corresponding list of unicode strings. The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When you’re at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a signficant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings. And avoids mapping to whitespace/control characters the bpe code barfs on.
- bridge.training.tokenizers.gpt2_tokenization.get_pairs(word)#
Return set of symbol pairs in a word.
Word is represented as tuple of symbols (symbols being variable-length strings).
- class bridge.training.tokenizers.gpt2_tokenization.GPT2Tokenizer(
- vocab_file,
- merges_file,
- errors='replace',
- special_tokens=None,
- max_len=None,
Bases:
object
GPT-2 BPE tokenizer. Peculiarities: - Byte-level BPE
Initialization
- classmethod from_pretrained(
- pretrained_model_name_or_path,
- cache_dir=None,
- *inputs,
- **kwargs,
Instantiate a PreTrainedBertModel from a pre-trained model file. Download and cache the pre-trained model file if needed.
- __len__()#
- set_special_tokens(special_tokens)#
Add a list of additional tokens to the encoder. The additional tokens are indexed starting from the last index of the current vocabulary in the order of the
special_tokens
list.
- bpe(token)#
- tokenize(text)#
Tokenize a string.
- convert_tokens_to_ids(tokens)#
Converts a sequence of tokens into ids using the vocab.
- convert_ids_to_tokens(ids, skip_special_tokens=False)#
Converts a sequence of ids in BPE tokens using the vocab.
- encode(text)#
- decode(tokens)#
- save_vocabulary(vocab_path)#
Save the tokenizer vocabulary and merge files to a directory.