`core.tokenizers.vision.vision_tokenizer`#

Module Contents#

Classes#

MegatronTokenizerVision

Base class for Megatron vision tokenizers.

Data#

TOKENIZER_MAPPING_LIBRARIES

API#

core.tokenizers.vision.vision_tokenizer.TOKENIZER_MAPPING_LIBRARIES#: ‘OrderedDict(…)’

class core.tokenizers.vision.vision_tokenizer.MegatronTokenizerVision(path: str, config: dict, **kwargs)#

Bases: megatron.core.tokenizers.base_tokenizer.MegatronTokenizerBase

Base class for Megatron vision tokenizers.

Initialization

Parameters:

path (str) – path to the tokenizer model.
config (dict) – tokenizer parameters. library (str): tokenizer library. class_name (str): name of tokenizer class. class_path (str): path to tokenizer class. model_type (str): type of the model to be used with tokenizer.

_restore_model(**kwargs)#: Returns tokenizer library object.

tokenize( text: Union[str, List[Dict]], ) → List[int]#

Text tokenization.

Parameters:: text (str | list) – text to be tokenized.
Returns:: list of ids.
Return type:: list

detokenize(ids: List[int]) → str#

Text detokenization.

Parameters:: ids (list) – text to be tokenized.
Returns:: detokenized text.
Return type:: text

tokenize_conversation( conversation: List[Dict], return_target: bool, add_generation_prompt: bool, )#

Convert a conversation to tokens.

Parameters:

conversation (List[Dict]) – Sequence of system/user/assistant messages. Must be in the following format: [ {“role”: “user”, “content”: “something”}, {“role”: “assistant”, “content”: “something2”}, ]
return_target (bool) – Return target tokens with system and assistant masked.
add_generation_prompt (bool) – Add assistant prefix to the end.

add_special_tokens(special_tokens: Union[list, dict]) → None#

Adds a dictionary of special tokens (eos, pad, cls…). Tokens are only added if they are not already in the vocabulary. Indexed starting from the last index of the current vocabulary.

Parameters:: special_tokens_dict – dict of string. Keys should be in the list of predefined special attributes: [bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens].

convert_tokens_to_ids(tokens: List[str])#: Convert tokens to IDs.

abstractmethod apply_chat_template()#: Applies tokenizer’s chat template.

get_special_tokens() → list#: Returns a list of the additional special tokens.

offsets(ids: list[int], text: str) → list[int]#: Calculate offsets.

property vocab#: Tokenizer vocab.

property vocab_size: int#: Returns vocabulary size.

property pad#: Pad token ID.

property eod#: End of sentence token ID.

core.tokenizers.vision.vision_tokenizer#

Module Contents#

Classes#

Data#

API#

`core.tokenizers.vision.vision_tokenizer`#