bridge.training.tokenizers.multimodal_tokenizer
#
Multimodal tokenizer.
Module Contents#
Classes#
Config options for different prompt formats. |
|
Multimodal Tokenizer. |
Data#
API#
- bridge.training.tokenizers.multimodal_tokenizer.IMAGE_TAGS#
None
- bridge.training.tokenizers.multimodal_tokenizer.mistral_custom_template = <Multiline-String>#
- bridge.training.tokenizers.multimodal_tokenizer.nvlm_yi_34b_template#
“{{- bos_token }}{% for message in messages %}{{‘<|im_start|>’ + message[‘role’] + ‘\n’ + message[‘co…”
- bridge.training.tokenizers.multimodal_tokenizer.qwen2p0_custom_template = <Multiline-String>#
- bridge.training.tokenizers.multimodal_tokenizer.llama3p1_chat_template = <Multiline-String>#
- class bridge.training.tokenizers.multimodal_tokenizer.PromptConfig#
Config options for different prompt formats.
- assistant_prefix_len: int#
None
- pad_token_id: int#
None
- custom_chat_template: str#
None
- has_bos: bool#
None
- has_system_role: bool#
None
- class bridge.training.tokenizers.multimodal_tokenizer.MultimodalTokenizer(
- tokenizer: megatron.core.datasets.megatron_tokenizer.MegatronTokenizer,
- prompt_format: str,
- special_tokens: List[str],
- image_tag_type: str,
Bases:
megatron.core.datasets.megatron_tokenizer.MegatronTokenizer
Multimodal Tokenizer.
Initialization
Tokenizer with a support for non-text inputs.
Note: Currently, only HuggingFaceTokenizer is supported as the underlying text tokenizer.
- Parameters:
tokenizer (MegatronTokenizer) – Underlying tokenizer.
prompt_format (str) – Prompt format for the tokenizer.
special_tokens (List[str]) – Non-text tokens.
image_tag_type (str) – Image tag to apply, if any. For example
.
- _apply_image_tag(text: Union[str, List[Dict]])#
Surround
with image tags such as and .
- tokenize(text: Union[str, List[Dict]])#
Tokenize conversation or string input.
- _encode(text: str)#
Tokenize text input.
- tokenize_conversation(
- conversation: List[Dict],
- return_target: bool,
- add_generation_prompt: bool,
Convert a conversation to tokens.
- Parameters:
conversation (List[Dict]) – Sequence of system/user/assistant messages. Must be in the following format: [ {“role”: “user”, “content”: “something”}, {“role”: “assistant”, “content”: “something2”}, ]
return_target (bool) – Return target tokens with system and assistant masked.
add_generation_prompt (bool) – Add assistant prefix to the end.
- convert_tokens_to_ids(tokens: List[str])#
Convert tokens to IDs.
- detokenize(tokens: List[int])#
Detokenize tokens.
- get_special_tokens()#
Get special tokens.
- property pad#
Pad token ID.
- property eod#
End of sentence token ID.
- property vocab#
Vocab.
- property inv_vocab#
Inverse vocab.
- property vocab_size#
Vocabulary size.