core.tokenizers.vision.vision_tokenizer#

Module Contents#

Classes#

MegatronTokenizerVision

Base class for Megatron vision tokenizers.

Data#

API#

core.tokenizers.vision.vision_tokenizer.TOKENIZER_MAPPING_LIBRARIES#

‘OrderedDict(…)’

class core.tokenizers.vision.vision_tokenizer.MegatronTokenizerVision(path: str, config: dict, **kwargs)#

Bases: megatron.core.tokenizers.base_tokenizer.MegatronTokenizerBase

Base class for Megatron vision tokenizers.

Initialization

Parameters:
  • path (str) – path to the tokenizer model.

  • config (dict) – tokenizer parameters. library (str): tokenizer library. class_name (str): name of tokenizer class. class_path (str): path to tokenizer class. model_type (str): type of the model to be used with tokenizer.

_restore_model(**kwargs)#

Returns tokenizer library object.

tokenize(
text: Union[str, List[Dict]],
) List[int]#

Text tokenization.

Parameters:

text (str | list) – text to be tokenized.

Returns:

list of ids.

Return type:

list

detokenize(ids: List[int]) str#

Text detokenization.

Parameters:

ids (list) – text to be tokenized.

Returns:

detokenized text.

Return type:

text

tokenize_conversation(
conversation: List[Dict],
return_target: bool,
add_generation_prompt: bool,
)#

Convert a conversation to tokens.

Parameters:
  • conversation (List[Dict]) – Sequence of system/user/assistant messages. Must be in the following format: [ {“role”: “user”, “content”: “something”}, {“role”: “assistant”, “content”: “something2”}, ]

  • return_target (bool) – Return target tokens with system and assistant masked.

  • add_generation_prompt (bool) – Add assistant prefix to the end.

add_special_tokens(special_tokens: Union[list, dict]) None#

Adds a dictionary of special tokens (eos, pad, cls…). Tokens are only added if they are not already in the vocabulary. Indexed starting from the last index of the current vocabulary.

Parameters:

special_tokens_dict – dict of string. Keys should be in the list of predefined special attributes: [bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens].

convert_tokens_to_ids(tokens: List[str])#

Convert tokens to IDs.

abstractmethod apply_chat_template()#

Applies tokenizer’s chat template.

get_special_tokens() list#

Returns a list of the additional special tokens.

offsets(ids: list[int], text: str) list[int]#

Calculate offsets.

property vocab#

Tokenizer vocab.

property vocab_size: int#

Returns vocabulary size.

property pad#

Pad token ID.

property eod#

End of sentence token ID.