bridge.training.tokenizers.gpt2_tokenization#

Tokenization classes for OpenAI GPT.

Module Contents#

Classes#

GPT2Tokenizer

GPT-2 BPE tokenizer. Peculiarities: - Byte-level BPE

Functions#

bytes_to_unicode

Returns list of utf-8 byte and a corresponding list of unicode strings. The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When you’re at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a signficant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings. And avoids mapping to whitespace/control characters the bpe code barfs on.

get_pairs

Return set of symbol pairs in a word.

Data#

API#

bridge.training.tokenizers.gpt2_tokenization.logger#

‘getLogger(…)’

bridge.training.tokenizers.gpt2_tokenization.PRETRAINED_VOCAB_ARCHIVE_MAP#

None

bridge.training.tokenizers.gpt2_tokenization.PRETRAINED_MERGES_ARCHIVE_MAP#

None

bridge.training.tokenizers.gpt2_tokenization.PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP#

None

bridge.training.tokenizers.gpt2_tokenization.VOCAB_NAME#

‘vocab.json’

bridge.training.tokenizers.gpt2_tokenization.MERGES_NAME#

‘merges.txt’

bridge.training.tokenizers.gpt2_tokenization.SPECIAL_TOKENS_NAME#

‘special_tokens.txt’

bridge.training.tokenizers.gpt2_tokenization.bytes_to_unicode()#

Returns list of utf-8 byte and a corresponding list of unicode strings. The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When you’re at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a signficant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings. And avoids mapping to whitespace/control characters the bpe code barfs on.

bridge.training.tokenizers.gpt2_tokenization.get_pairs(word)#

Return set of symbol pairs in a word.

Word is represented as tuple of symbols (symbols being variable-length strings).

class bridge.training.tokenizers.gpt2_tokenization.GPT2Tokenizer(
vocab_file,
merges_file,
errors='replace',
special_tokens=None,
max_len=None,
)#

Bases: object

GPT-2 BPE tokenizer. Peculiarities: - Byte-level BPE

Initialization

classmethod from_pretrained(
pretrained_model_name_or_path,
cache_dir=None,
*inputs,
**kwargs,
)#

Instantiate a PreTrainedBertModel from a pre-trained model file. Download and cache the pre-trained model file if needed.

__len__()#
set_special_tokens(special_tokens)#

Add a list of additional tokens to the encoder. The additional tokens are indexed starting from the last index of the current vocabulary in the order of the special_tokens list.

bpe(token)#
tokenize(text)#

Tokenize a string.

convert_tokens_to_ids(tokens)#

Converts a sequence of tokens into ids using the vocab.

convert_ids_to_tokens(ids, skip_special_tokens=False)#

Converts a sequence of ids in BPE tokens using the vocab.

encode(text)#
decode(tokens)#
save_vocabulary(vocab_path)#

Save the tokenizer vocabulary and merge files to a directory.