core.tokenizers.vision.vision_tokenizer#
Module Contents#
Classes#
Base class for Megatron vision tokenizers. |
Data#
API#
- core.tokenizers.vision.vision_tokenizer.TOKENIZER_MAPPING_LIBRARIES#
‘OrderedDict(…)’
- class core.tokenizers.vision.vision_tokenizer.MegatronTokenizerVision(path: str, config: dict, **kwargs)#
Bases:
megatron.core.tokenizers.base_tokenizer.MegatronTokenizerBaseBase class for Megatron vision tokenizers.
Initialization
- Parameters:
path (str) – path to the tokenizer model.
config (dict) – tokenizer parameters. library (str): tokenizer library. class_name (str): name of tokenizer class. class_path (str): path to tokenizer class. model_type (str): type of the model to be used with tokenizer.
- _restore_model(**kwargs)#
Returns tokenizer library object.
- tokenize(
- text: Union[str, List[Dict]],
Text tokenization.
- Parameters:
text (str | list) – text to be tokenized.
- Returns:
list of ids.
- Return type:
list
- detokenize(ids: List[int]) str#
Text detokenization.
- Parameters:
ids (list) – text to be tokenized.
- Returns:
detokenized text.
- Return type:
text
- tokenize_conversation(
- conversation: List[Dict],
- return_target: bool,
- add_generation_prompt: bool,
Convert a conversation to tokens.
- Parameters:
conversation (List[Dict]) – Sequence of system/user/assistant messages. Must be in the following format: [ {“role”: “user”, “content”: “something”}, {“role”: “assistant”, “content”: “something2”}, ]
return_target (bool) – Return target tokens with system and assistant masked.
add_generation_prompt (bool) – Add assistant prefix to the end.
- add_special_tokens(special_tokens: Union[list, dict]) None#
Adds a dictionary of special tokens (eos, pad, cls…). Tokens are only added if they are not already in the vocabulary. Indexed starting from the last index of the current vocabulary.
- Parameters:
special_tokens_dict – dict of string. Keys should be in the list of predefined special attributes: [
bos_token,eos_token,unk_token,sep_token,pad_token,cls_token,mask_token,additional_special_tokens].
- convert_tokens_to_ids(tokens: List[str])#
Convert tokens to IDs.
- abstractmethod apply_chat_template()#
Applies tokenizer’s chat template.
- get_special_tokens() list#
Returns a list of the additional special tokens.
- offsets(ids: list[int], text: str) list[int]#
Calculate offsets.
- property vocab#
Tokenizer vocab.
- property vocab_size: int#
Returns vocabulary size.
- property pad#
Pad token ID.
- property eod#
End of sentence token ID.