`bridge.training.tokenizers.gpt2_tokenization`#

Tokenization classes for OpenAI GPT.

Module Contents#

Classes#

GPT2Tokenizer

GPT-2 BPE tokenizer. Peculiarities: - Byte-level BPE

Functions#

bytes_to_unicode

Returns list of utf-8 byte and a corresponding list of unicode strings. The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When you’re at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a signficant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings. And avoids mapping to whitespace/control characters the bpe code barfs on.

get_pairs

Return set of symbol pairs in a word.

Data#

`logger`
`PRETRAINED_VOCAB_ARCHIVE_MAP`
`PRETRAINED_MERGES_ARCHIVE_MAP`
`PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP`
`VOCAB_NAME`
`MERGES_NAME`
`SPECIAL_TOKENS_NAME`

API#

bridge.training.tokenizers.gpt2_tokenization.logger#: ‘getLogger(…)’

bridge.training.tokenizers.gpt2_tokenization.PRETRAINED_VOCAB_ARCHIVE_MAP#: None

bridge.training.tokenizers.gpt2_tokenization.PRETRAINED_MERGES_ARCHIVE_MAP#: None

bridge.training.tokenizers.gpt2_tokenization.PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP#: None

bridge.training.tokenizers.gpt2_tokenization.VOCAB_NAME#: ‘vocab.json’

bridge.training.tokenizers.gpt2_tokenization.MERGES_NAME#: ‘merges.txt’

bridge.training.tokenizers.gpt2_tokenization.SPECIAL_TOKENS_NAME#: ‘special_tokens.txt’

bridge.training.tokenizers.gpt2_tokenization.bytes_to_unicode()#: Returns list of utf-8 byte and a corresponding list of unicode strings. The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When you’re at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a signficant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings. And avoids mapping to whitespace/control characters the bpe code barfs on.

bridge.training.tokenizers.gpt2_tokenization.get_pairs(word)#

Return set of symbol pairs in a word.

Word is represented as tuple of symbols (symbols being variable-length strings).

class bridge.training.tokenizers.gpt2_tokenization.GPT2Tokenizer( vocab_file, merges_file, errors='replace', special_tokens=None, max_len=None, )#

Bases: object