core.tokenizers.text.text_tokenizer#
Module Contents#
Classes#
Base class for Megatron text tokenizers. |
Data#
API#
- core.tokenizers.text.text_tokenizer.TOKENIZER_MAPPING_LIBRARIES#
‘OrderedDict(…)’
- class core.tokenizers.text.text_tokenizer.MegatronTokenizerText(path: str, config: dict, **kwargs)#
Bases:
megatron.core.tokenizers.base_tokenizer.MegatronTokenizerBaseBase class for Megatron text tokenizers.
Initialization
- Parameters:
path (str) – path to the tokenizer model.
config (dict) – tokenizer parameters. library (str): tokenizer library. class_name (str): name of tokenizer class. class_path (str): path to tokenizer class. model_type (str): type of the model to be used with tokenizer. chat_template (str): tokenizer chat template.
- _restore_model(
- **kwargs,
Returns tokenizer library object.
- tokenize(text: str) List[int]#
Text tokenization.
- Parameters:
text (str) – text to be tokenized.
- Returns:
list of ids.
- Return type:
list
- detokenize(ids: List[int]) str#
Text detokenization.
- Parameters:
ids (list) – text to be tokenized.
- Returns:
dettokenized text.
- Return type:
text
- apply_chat_template(
- conversation: List[Dict[str, str]],
- chat_template: Optional[str] = None,
- **kwargs,
Applies chat template to the conversation.
- Parameters:
conversation (list)
chat_template (Optional[str]) – chat template to be use. If not specified, tokenizer’s chat template will be used.
- Returns:
a chat with applied chat template or a list of token ids.
- Return type:
Union[str, list]
- save_pretrained(path: str) None#
Saves HF tokenizer files.
- Parameters:
path (str) – path where to save tokenizer files.
- add_special_tokens(special_tokens: Union[list, dict]) None#
Adds a dictionary of special tokens (eos, pad, cls…). Tokens are only added if they are not already in the vocabulary. Indexed starting from the last index of the current vocabulary.
- Parameters:
special_tokens_dict – dict of string. Keys should be in the list of predefined special attributes: [
bos_token,eos_token,unk_token,sep_token,pad_token,cls_token,mask_token,additional_special_tokens].
- property additional_special_tokens_ids: list#
Returns a list of the additional special tokens.
- property vocab_size: int#
Returns vocabulary size.
- property vocab#
Returns tokenizer vocabulary.
- property unique_identifiers: collections.OrderedDict#
Returns a dictionary of unique identifiers.
- property pad: int#
Returns id of padding token.
- property pad_id: int#
Returns id of padding token. Need for NeMo.
- property eod: int#
Returns id of end of document token.
- property bos: int#
Returns id of beginning of sentence token.
- property bos_id: int#
Returns id of beginning of sentence token. Need for NeMo.
- property eos_id: int#
Returns id of end of sentence token.
- property eos: int#
Returns id of end of sentence token. Need for legacy.
- property unk: int#
Returns id of of unknown token.
- property unk_id: int#
Returns id of of unknown token. Need for NeMo.
- property mask: int#
Returns id of of mask token.
- property mask_id: int#
Returns id of of mask token. Need for NeMo.
- property cls: int#
Returns id of classification token.
- property cls_id: int#
Returns id of classification token. Need for NeMo.
- property sep: int#
Returns id of SEP token.
- property sep_id: int#
Returns id of SEP token. Need for NeMo.
- property vocab_file: str#
Returns vocabulary file path if specified.
- property merges_file: str#
Returns merges file path if specified.
- property inv_vocab: dict#
Returns tokenizer vocab with reversed keys and values.