core.tokenizers.text.libraries.huggingface_tokenizer#
Module Contents#
Classes#
Wrapper of HuggingFace AutoTokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer. |
Data#
API#
- core.tokenizers.text.libraries.huggingface_tokenizer.logger#
‘getLogger(…)’
- class core.tokenizers.text.libraries.huggingface_tokenizer.HuggingFaceTokenizer(
- tokenizer_path: str,
- vocab_file: Optional[str] = None,
- merges_file: Optional[str] = None,
- mask_token: Optional[str] = None,
- bos_token: Optional[str] = None,
- eos_token: Optional[str] = None,
- pad_token: Optional[str] = None,
- sep_token: Optional[str] = None,
- cls_token: Optional[str] = None,
- unk_token: Optional[str] = None,
- additional_special_tokens: Optional[List] = [],
- use_fast: Optional[bool] = False,
- trust_remote_code: Optional[bool] = False,
- include_special_tokens: bool = False,
- chat_template: str = None,
Bases:
core.tokenizers.text.libraries.abstract_tokenizer.MegatronTokenizerTextAbstractWrapper of HuggingFace AutoTokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer.
Initialization
- Parameters:
tokenizer_path – corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument. For more details please refer to https://huggingface.co/transformers/_modules/transformers/tokenization_auto.html#AutoTokenizer.from_pretrained.
vocab_file – path to file with vocabulary which consists of characters separated by newlines.
mask_token – mask token
bos_token – the beginning of sequence token
eos_token – the end of sequence token. Usually equal to sep_token
pad_token – token to use for padding
sep_token – token used for separating sequences
cls_token – class token. Usually equal to bos_token
unk_token – token to use for unknown tokens
additional_special_tokens – list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)
use_fast – whether to use fast HuggingFace tokenizer
include_special_tokens – when True, converting text to ids will include special tokens / prompt tokens (if any), yielding self.tokenizer(text).input_ids
- add_special_tokens(special_tokens_dict: dict) int#
Adds a dictionary of special tokens (eos, pad, cls…). If special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the current vocabulary).
- Parameters:
special_tokens_dict – dict of string. Keys should be in the list of predefined special attributes: [
bos_token,eos_token,unk_token,sep_token,pad_token,cls_token,mask_token,additional_special_tokens]. Tokens are only added if they are not already in the vocabulary.- Returns:
Number of tokens added to the vocabulary.
- property additional_special_tokens_ids#
Returns a list of the additional special tokens (excluding bos, eos, pad, unk). Used to return sentinel tokens for e.g. T5.
- text_to_tokens(text: str) List[str]#
Converts text to tokens.
- tokens_to_text(tokens: List[str]) str#
Converts list of tokens text.
- token_to_id(token: str) int#
Converts a single token to it’s id.
- tokens_to_ids(tokens: List[str]) List[int]#
Converts list of tokens to it’s ids.
- ids_to_tokens(ids: List[int]) List[str]#
Converts list of tokens ids to it’s token values.
- text_to_ids(text: str) List[int]#
Converts text to tokens ids.
- ids_to_text(
- ids: List[int],
- remove_special_tokens: bool = True,
Converts list of ids to text.
- apply_chat_template(conversation, chat_template, **kwargs)#
Applies chat template and tokenizes results
- property vocab: list#
Returns tokenizer vocab values.
- property inv_vocab: dict#
Returns tokenizer vocab with reversed keys and values.
- property vocab_size: int#
Returns size of tokenizer vocabulary.
- property pad_id: int#
Returns id of padding token.
- property bos_id: int#
Returns id of beggining of sentence token.
- property eos_id: int#
Returns id of end of sentence token.
- property eod: int#
Returns EOD token id.
- property sep_id: int#
Returns id of SEP token.
- property cls_id: int#
Returns id of classification token.
- property unk_id: int#
Returns id of unknown tokens.
- property mask_id: int#
Returns id of mask token.
- save_vocabulary(save_directory: str, filename_prefix: str = None)#
Saves tokenizer’s vocabulary and other artifacts to the specified directory
- save_pretrained(save_directory: str)#
Saves tokenizer’s vocabulary and other artifacts to the specified directory