bridge.training.tokenizers.multimodal_tokenizer#

Multimodal tokenizer.

Module Contents#

Classes#

PromptConfig

Config options for different prompt formats.

MultimodalTokenizer

Multimodal Tokenizer.

Data#

API#

bridge.training.tokenizers.multimodal_tokenizer.IMAGE_TAGS#

None

bridge.training.tokenizers.multimodal_tokenizer.mistral_custom_template = <Multiline-String>#
bridge.training.tokenizers.multimodal_tokenizer.nvlm_yi_34b_template#

“{{- bos_token }}{% for message in messages %}{{‘<|im_start|>’ + message[‘role’] + ‘\n’ + message[‘co…”

bridge.training.tokenizers.multimodal_tokenizer.qwen2p0_custom_template = <Multiline-String>#
bridge.training.tokenizers.multimodal_tokenizer.llama3p1_chat_template = <Multiline-String>#
class bridge.training.tokenizers.multimodal_tokenizer.PromptConfig#

Config options for different prompt formats.

assistant_prefix_len: int#

None

pad_token_id: int#

None

custom_chat_template: str#

None

has_bos: bool#

None

has_system_role: bool#

None

class bridge.training.tokenizers.multimodal_tokenizer.MultimodalTokenizer(
tokenizer: megatron.core.datasets.megatron_tokenizer.MegatronTokenizer,
prompt_format: str,
special_tokens: List[str],
image_tag_type: str,
)#

Bases: megatron.core.datasets.megatron_tokenizer.MegatronTokenizer

Multimodal Tokenizer.

Initialization

Tokenizer with a support for non-text inputs.

Note: Currently, only HuggingFaceTokenizer is supported as the underlying text tokenizer.

Parameters:
  • tokenizer (MegatronTokenizer) – Underlying tokenizer.

  • prompt_format (str) – Prompt format for the tokenizer.

  • special_tokens (List[str]) – Non-text tokens.

  • image_tag_type (str) – Image tag to apply, if any. For example .

_apply_image_tag(text: Union[str, List[Dict]])#

Surround with image tags such as and .

tokenize(text: Union[str, List[Dict]])#

Tokenize conversation or string input.

_encode(text: str)#

Tokenize text input.

tokenize_conversation(
conversation: List[Dict],
return_target: bool,
add_generation_prompt: bool,
)#

Convert a conversation to tokens.

Parameters:
  • conversation (List[Dict]) – Sequence of system/user/assistant messages. Must be in the following format: [ {“role”: “user”, “content”: “something”}, {“role”: “assistant”, “content”: “something2”}, ]

  • return_target (bool) – Return target tokens with system and assistant masked.

  • add_generation_prompt (bool) – Add assistant prefix to the end.

convert_tokens_to_ids(tokens: List[str])#

Convert tokens to IDs.

detokenize(tokens: List[int])#

Detokenize tokens.

get_special_tokens()#

Get special tokens.

property pad#

Pad token ID.

property eod#

End of sentence token ID.

property vocab#

Vocab.

property inv_vocab#

Inverse vocab.

property vocab_size#

Vocabulary size.