bridge.training.tokenizers.bert_tokenization
#
Tokenization classes.
Module Contents#
Classes#
Runs end-to-end tokenziation. |
|
Runs basic tokenization (punctuation splitting, lower casing, etc.). |
|
Runs WordPiece tokenziation. |
Functions#
Checks whether the casing config is consistent with the checkpoint name. |
|
Converts |
|
Returns text encoded in a way suitable for print or |
|
Loads a vocabulary file into a dictionary. |
|
Converts a sequence of [tokens|ids] using the vocab. |
|
Converts a sequence of tokens to ids using the vocab. |
|
Converts a sequence of ids to tokens using the inv_vocab. |
|
Runs basic whitespace cleaning and splitting on a piece of text. |
|
Checks whether |
|
Checks whether |
|
Checks whether |
API#
- bridge.training.tokenizers.bert_tokenization.validate_case_matches_checkpoint(do_lower_case, init_checkpoint)#
Checks whether the casing config is consistent with the checkpoint name.
- bridge.training.tokenizers.bert_tokenization.convert_to_unicode(text)#
Converts
text
to Unicode (if it’s not already), assuming utf-8 input.
- bridge.training.tokenizers.bert_tokenization.printable_text(text)#
Returns text encoded in a way suitable for print or
tf.logging
.
- bridge.training.tokenizers.bert_tokenization.load_vocab(vocab_file)#
Loads a vocabulary file into a dictionary.
- bridge.training.tokenizers.bert_tokenization.convert_by_vocab(vocab, items)#
Converts a sequence of [tokens|ids] using the vocab.
- bridge.training.tokenizers.bert_tokenization.convert_tokens_to_ids(vocab, tokens)#
Converts a sequence of tokens to ids using the vocab.
- bridge.training.tokenizers.bert_tokenization.convert_ids_to_tokens(inv_vocab, ids)#
Converts a sequence of ids to tokens using the inv_vocab.
- bridge.training.tokenizers.bert_tokenization.whitespace_tokenize(text)#
Runs basic whitespace cleaning and splitting on a piece of text.
- class bridge.training.tokenizers.bert_tokenization.FullTokenizer(vocab_file, do_lower_case=True)#
Bases:
object
Runs end-to-end tokenziation.
Initialization
- tokenize(text)#
- convert_tokens_to_ids(tokens)#
- convert_ids_to_tokens(ids)#
- static convert_tokens_to_string(tokens, clean_up_tokenization_spaces=True)#
Converts a sequence of tokens (string) in a single string.
- vocab_size()#
- class bridge.training.tokenizers.bert_tokenization.BasicTokenizer(do_lower_case=True)#
Bases:
object
Runs basic tokenization (punctuation splitting, lower casing, etc.).
Initialization
Constructs a BasicTokenizer.
- Parameters:
do_lower_case – Whether to lower case the input.
- tokenize(text)#
Tokenizes a piece of text.
- _run_strip_accents(text)#
Strips accents from a piece of text.
- _run_split_on_punc(text)#
Splits punctuation on a piece of text.
- _tokenize_chinese_chars(text)#
Adds whitespace around any CJK character.
- _is_chinese_char(cp)#
Checks whether CP is the codepoint of a CJK character.
- _clean_text(text)#
Performs invalid character removal and whitespace cleanup on text.
- class bridge.training.tokenizers.bert_tokenization.WordpieceTokenizer(
- vocab,
- unk_token='[UNK]',
- max_input_chars_per_word=200,
Bases:
object
Runs WordPiece tokenziation.
Initialization
- tokenize(text)#
Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.
For example: input = “unaffable” output = [“un”, “##aff”, “##able”]
- Parameters:
text – A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer.
- Returns:
A list of wordpiece tokens.
- bridge.training.tokenizers.bert_tokenization._is_whitespace(char)#
Checks whether
chars
is a whitespace character.
- bridge.training.tokenizers.bert_tokenization._is_control(char)#
Checks whether
chars
is a control character.
- bridge.training.tokenizers.bert_tokenization._is_punctuation(char)#
Checks whether
chars
is a punctuation character.