core.tokenizers.vision.libraries.multimodal_tokenizer#
Module Contents#
Classes#
Multimodal Tokenizer. |
Data#
API#
- core.tokenizers.vision.libraries.multimodal_tokenizer.IMAGE_TAGS#
None
- core.tokenizers.vision.libraries.multimodal_tokenizer.mistral_custom_template = <Multiline-String>#
- core.tokenizers.vision.libraries.multimodal_tokenizer.nvlm_yi_34b_template = <Multiline-String>#
- core.tokenizers.vision.libraries.multimodal_tokenizer.qwen2p0_custom_template = <Multiline-String>#
- core.tokenizers.vision.libraries.multimodal_tokenizer.llama3p1_chat_template = <Multiline-String>#
- core.tokenizers.vision.libraries.multimodal_tokenizer.nemotron_custom_template = <Multiline-String>#
- core.tokenizers.vision.libraries.multimodal_tokenizer.nemotron_aligned_custom_template = <Multiline-String>#
- class core.tokenizers.vision.libraries.multimodal_tokenizer.MegatronMultimodalTokenizer(
- path: str,
- prompt_format: str,
- special_tokens: List[str],
- image_tag_type: str,
- force_system_message: bool = False,
- **kwargs,
Multimodal Tokenizer.
Initialization
Tokenizer with a support for non-text inputs.
Note: Currently, only HuggingFaceTokenizer is supported as the underlying text tokenizer.
- Parameters:
path (str) – Path to the underlying tokenizer.
prompt_format (str) – Prompt format for the tokenizer.
special_tokens (List[str]) – Non-text tokens.
image_tag_type (str) – Image tag to apply, if any. For example
.
- _apply_image_tag(text: Union[str, List[Dict]])#
Surround
with image tags such as and .
- tokenize(text: Union[str, List[Dict]])#
Tokenize conversation or string input.
- _encode(text: str)#
Tokenize text input.
- tokenize_conversation(
- conversation: List[Dict],
- return_target: bool,
- add_generation_prompt: bool,
Convert a conversation to tokens.
- Parameters:
conversation (List[Dict]) – Sequence of system/user/assistant messages. Must be in the following format: [ {“role”: “user”, “content”: “something”}, {“role”: “assistant”, “content”: “something2”}, ]
return_target (bool) – Return target tokens with system and assistant masked.
add_generation_prompt (bool) – Add assistant prefix to the end.
- convert_tokens_to_ids(tokens: List[str])#
Convert tokens to IDs.
- detokenize(tokens: List[int])#
Detokenize tokens.
- add_special_tokens(special_tokens: List[str])#
Add special tokens.
- get_special_tokens()#
Get special tokens.
- property pad#
Pad token ID.
- property eod#
End of sentence token ID.
- property vocab_size#
Vocabulary size.
- property vocab#
Tokenizer vocab.