core.tokenizers.text.libraries.huggingface_tokenizer#

Module Contents#

Classes#

HuggingFaceTokenizer

Wrapper of HuggingFace AutoTokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer.

Data#

API#

core.tokenizers.text.libraries.huggingface_tokenizer.logger#

‘getLogger(…)’

class core.tokenizers.text.libraries.huggingface_tokenizer.HuggingFaceTokenizer(
tokenizer_path: str,
vocab_file: Optional[str] = None,
merges_file: Optional[str] = None,
mask_token: Optional[str] = None,
bos_token: Optional[str] = None,
eos_token: Optional[str] = None,
pad_token: Optional[str] = None,
sep_token: Optional[str] = None,
cls_token: Optional[str] = None,
unk_token: Optional[str] = None,
additional_special_tokens: Optional[List] = [],
use_fast: Optional[bool] = False,
trust_remote_code: Optional[bool] = False,
include_special_tokens: bool = False,
chat_template: str = None,
)#

Bases: core.tokenizers.text.libraries.abstract_tokenizer.MegatronTokenizerTextAbstract

Wrapper of HuggingFace AutoTokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer.

Initialization

Parameters:
  • tokenizer_path – corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument. For more details please refer to https://huggingface.co/transformers/_modules/transformers/tokenization_auto.html#AutoTokenizer.from_pretrained.

  • vocab_file – path to file with vocabulary which consists of characters separated by newlines.

  • mask_token – mask token

  • bos_token – the beginning of sequence token

  • eos_token – the end of sequence token. Usually equal to sep_token

  • pad_token – token to use for padding

  • sep_token – token used for separating sequences

  • cls_token – class token. Usually equal to bos_token

  • unk_token – token to use for unknown tokens

  • additional_special_tokens – list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)

  • use_fast – whether to use fast HuggingFace tokenizer

  • include_special_tokens – when True, converting text to ids will include special tokens / prompt tokens (if any), yielding self.tokenizer(text).input_ids

add_special_tokens(special_tokens_dict: dict) int#

Adds a dictionary of special tokens (eos, pad, cls…). If special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the current vocabulary).

Parameters:

special_tokens_dict – dict of string. Keys should be in the list of predefined special attributes: [bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens]. Tokens are only added if they are not already in the vocabulary.

Returns:

Number of tokens added to the vocabulary.

property additional_special_tokens_ids#

Returns a list of the additional special tokens (excluding bos, eos, pad, unk). Used to return sentinel tokens for e.g. T5.

text_to_tokens(text: str) List[str]#

Converts text to tokens.

tokens_to_text(tokens: List[str]) str#

Converts list of tokens text.

token_to_id(token: str) int#

Converts a single token to it’s id.

tokens_to_ids(tokens: List[str]) List[int]#

Converts list of tokens to it’s ids.

ids_to_tokens(ids: List[int]) List[str]#

Converts list of tokens ids to it’s token values.

text_to_ids(text: str) List[int]#

Converts text to tokens ids.

ids_to_text(
ids: List[int],
remove_special_tokens: bool = True,
) str#

Converts list of ids to text.

apply_chat_template(conversation, chat_template, **kwargs)#

Applies chat template and tokenizes results

property vocab: list#

Returns tokenizer vocab values.

property inv_vocab: dict#

Returns tokenizer vocab with reversed keys and values.

property vocab_size: int#

Returns size of tokenizer vocabulary.

property pad_id: int#

Returns id of padding token.

property bos_id: int#

Returns id of beggining of sentence token.

property eos_id: int#

Returns id of end of sentence token.

property eod: int#

Returns EOD token id.

property sep_id: int#

Returns id of SEP token.

property cls_id: int#

Returns id of classification token.

property unk_id: int#

Returns id of unknown tokens.

property mask_id: int#

Returns id of mask token.

save_vocabulary(save_directory: str, filename_prefix: str = None)#

Saves tokenizer’s vocabulary and other artifacts to the specified directory

save_pretrained(save_directory: str)#

Saves tokenizer’s vocabulary and other artifacts to the specified directory