bridge.data.vlm_datasets.step37_flickr8k.template#
Step3.7 multimodal SFT template.
Loads the tokenizer with transformers.AutoTokenizer.from_pretrained
and trust_remote_code=False (so no custom HF Python code is executed).
This is the only transformers-library use in this package; everything
else is pure torch / huggingface_hub.
The tokenize path (apply_chat_template) uses the local
chat_template.jinja shipped with step3p7_flash_bf16, which
determines the token sequence produced for a given input dialog.
Module Contents#
Classes#
Tokenized SFT sample whose length is the shifted LM training length. |
|
Step3.7 SFT tokenize template. |
Functions#
Expand a single |
|
Load a tokenizer from a local HF snapshot. |
Data#
API#
- bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_PLACEHOLDER#
β
β
- bridge.data.vlm_datasets.step37_flickr8k.template.MULTICROP_IMAGE_PLACEHOLDER#
β<@image@>β
- bridge.data.vlm_datasets.step37_flickr8k.template.MULTICROP_PATCH_PLACEHOLDER#
β<#image#>β
- bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_TOKEN#
β<im_patch>β
- bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_START_TOKEN#
β<im_start>β
- bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_END_TOKEN#
β<im_end>β
- bridge.data.vlm_datasets.step37_flickr8k.template.PATCH_START_TOKEN#
β<patch_start>β
- bridge.data.vlm_datasets.step37_flickr8k.template.PATCH_END_TOKEN#
β<patch_end>β
- bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_TOKEN_COUNT#
169
- bridge.data.vlm_datasets.step37_flickr8k.template.PATCH_TOKEN_COUNT#
81
- bridge.data.vlm_datasets.step37_flickr8k.template.logger#
βgetLogger(β¦)β
- bridge.data.vlm_datasets.step37_flickr8k.template._identity_path(path: str) str#
- bridge.data.vlm_datasets.step37_flickr8k.template._expand_step37_image_placeholders(
- text: str,
- *,
- image_token_count: int = IMAGE_TOKEN_COUNT,
- patch_token_count: int = PATCH_TOKEN_COUNT,
- image_token: str = IMAGE_TOKEN,
- image_start_token: str = IMAGE_START_TOKEN,
- image_end_token: str = IMAGE_END_TOKEN,
Expand a single
<image>placeholder into<im_start>+<im_patch>Γ 169 +<im_end>(or the multicrop variant for patches).
- class bridge.data.vlm_datasets.step37_flickr8k.template.MultimodalSFTSample#
Bases:
dictTokenized SFT sample whose length is the shifted LM training length.
len(sample) = tokens.numel() - 1because the pack step usestokens[:-1]/tokens[1:]shift-by-one.Initialization
Initialize self. See help(type(self)) for accurate signature.
- __len__() int#
- bridge.data.vlm_datasets.step37_flickr8k.template._load_hf_tokenizer(tokenizer_path: str)#
Load a tokenizer from a local HF snapshot.
trust_remote_code=Falseis hard-coded β we never execute custom HF Python code.
- class bridge.data.vlm_datasets.step37_flickr8k.template.Step37MultimodalTemplate(
- *,
- tokenizer_path: str,
- image_token_count: int,
- patch_token_count: int,
- image_token: str,
- image_start_token: str,
- image_end_token: str,
- patch_start_token: str,
- patch_end_token: str,
- max_sequence_length: int,
- path_rewrite_fn: Optional[collections.abc.Callable[[str], str]] = None,
Step3.7 SFT tokenize template.
Expands
<image>placeholders β<im_start><im_patch>Γ169<im_end>inside every user / tool turn, then runstokenizer.apply_chat_template(messages, tokenize=True)to producetokens(LongTensor). Theloss_maskis set to 1 only on the assistant turn span(s), found by re-tokenizing the prefix up to and including each assistant turn.Initialization
- image_placeholder#
None
- multicrop_image_placeholder#
None
- multicrop_patch_placeholder#
None
- _expand_image_placeholders(text: str) str#
- _normalize_messages(
- data: list[dict[str, Any]],
- _apply_chat_template(
- messages: list[dict[str, Any]],
- _normalize_images(
- raw_images: Optional[list[Any]],
- __call__(
- data: dict,