bridge.training.tokenizers.bert_tokenization#

Tokenization classes.

Module Contents#

Classes#

FullTokenizer

Runs end-to-end tokenziation.

BasicTokenizer

Runs basic tokenization (punctuation splitting, lower casing, etc.).

WordpieceTokenizer

Runs WordPiece tokenziation.

Functions#

validate_case_matches_checkpoint

Checks whether the casing config is consistent with the checkpoint name.

convert_to_unicode

Converts text to Unicode (if it’s not already), assuming utf-8 input.

printable_text

Returns text encoded in a way suitable for print or tf.logging.

load_vocab

Loads a vocabulary file into a dictionary.

convert_by_vocab

Converts a sequence of [tokens|ids] using the vocab.

convert_tokens_to_ids

Converts a sequence of tokens to ids using the vocab.

convert_ids_to_tokens

Converts a sequence of ids to tokens using the inv_vocab.

whitespace_tokenize

Runs basic whitespace cleaning and splitting on a piece of text.

_is_whitespace

Checks whether chars is a whitespace character.

_is_control

Checks whether chars is a control character.

_is_punctuation

Checks whether chars is a punctuation character.

API#

bridge.training.tokenizers.bert_tokenization.validate_case_matches_checkpoint(do_lower_case, init_checkpoint)#

Checks whether the casing config is consistent with the checkpoint name.

bridge.training.tokenizers.bert_tokenization.convert_to_unicode(text)#

Converts text to Unicode (if it’s not already), assuming utf-8 input.

bridge.training.tokenizers.bert_tokenization.printable_text(text)#

Returns text encoded in a way suitable for print or tf.logging.

bridge.training.tokenizers.bert_tokenization.load_vocab(vocab_file)#

Loads a vocabulary file into a dictionary.

bridge.training.tokenizers.bert_tokenization.convert_by_vocab(vocab, items)#

Converts a sequence of [tokens|ids] using the vocab.

bridge.training.tokenizers.bert_tokenization.convert_tokens_to_ids(vocab, tokens)#

Converts a sequence of tokens to ids using the vocab.

bridge.training.tokenizers.bert_tokenization.convert_ids_to_tokens(inv_vocab, ids)#

Converts a sequence of ids to tokens using the inv_vocab.

bridge.training.tokenizers.bert_tokenization.whitespace_tokenize(text)#

Runs basic whitespace cleaning and splitting on a piece of text.

class bridge.training.tokenizers.bert_tokenization.FullTokenizer(vocab_file, do_lower_case=True)#

Bases: object

Runs end-to-end tokenziation.

Initialization

tokenize(text)#
convert_tokens_to_ids(tokens)#
convert_ids_to_tokens(ids)#
static convert_tokens_to_string(tokens, clean_up_tokenization_spaces=True)#

Converts a sequence of tokens (string) in a single string.

vocab_size()#
class bridge.training.tokenizers.bert_tokenization.BasicTokenizer(do_lower_case=True)#

Bases: object

Runs basic tokenization (punctuation splitting, lower casing, etc.).

Initialization

Constructs a BasicTokenizer.

Parameters:

do_lower_case – Whether to lower case the input.

tokenize(text)#

Tokenizes a piece of text.

_run_strip_accents(text)#

Strips accents from a piece of text.

_run_split_on_punc(text)#

Splits punctuation on a piece of text.

_tokenize_chinese_chars(text)#

Adds whitespace around any CJK character.

_is_chinese_char(cp)#

Checks whether CP is the codepoint of a CJK character.

_clean_text(text)#

Performs invalid character removal and whitespace cleanup on text.

class bridge.training.tokenizers.bert_tokenization.WordpieceTokenizer(
vocab,
unk_token='[UNK]',
max_input_chars_per_word=200,
)#

Bases: object

Runs WordPiece tokenziation.

Initialization

tokenize(text)#

Tokenizes a piece of text into its word pieces.

This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.

For example: input = “unaffable” output = [“un”, “##aff”, “##able”]

Parameters:

text – A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer.

Returns:

A list of wordpiece tokens.

bridge.training.tokenizers.bert_tokenization._is_whitespace(char)#

Checks whether chars is a whitespace character.

bridge.training.tokenizers.bert_tokenization._is_control(char)#

Checks whether chars is a control character.

bridge.training.tokenizers.bert_tokenization._is_punctuation(char)#

Checks whether chars is a punctuation character.