bridge.data.vlm_datasets.step37_flickr8k.template#

Step3.7 multimodal SFT template.

Loads the tokenizer with transformers.AutoTokenizer.from_pretrained and trust_remote_code=False (so no custom HF Python code is executed). This is the only transformers-library use in this package; everything else is pure torch / huggingface_hub.

The tokenize path (apply_chat_template) uses the local chat_template.jinja shipped with step3p7_flash_bf16, which determines the token sequence produced for a given input dialog.

Module Contents#

Classes#

MultimodalSFTSample

Tokenized SFT sample whose length is the shifted LM training length.

Step37MultimodalTemplate

Step3.7 SFT tokenize template.

Functions#

_identity_path

_expand_step37_image_placeholders

Expand a single <image> placeholder into <im_start> + <im_patch> Γ— 169 + <im_end> (or the multicrop variant for patches).

_load_hf_tokenizer

Load a tokenizer from a local HF snapshot. trust_remote_code=False is hard-coded β€” we never execute custom HF Python code.

Data#

API#

bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_PLACEHOLDER#

β€˜β€™

bridge.data.vlm_datasets.step37_flickr8k.template.MULTICROP_IMAGE_PLACEHOLDER#

β€˜<@image@>’

bridge.data.vlm_datasets.step37_flickr8k.template.MULTICROP_PATCH_PLACEHOLDER#

β€˜<#image#>’

bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_TOKEN#

β€˜<im_patch>’

bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_START_TOKEN#

β€˜<im_start>’

bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_END_TOKEN#

β€˜<im_end>’

bridge.data.vlm_datasets.step37_flickr8k.template.PATCH_START_TOKEN#

β€˜<patch_start>’

bridge.data.vlm_datasets.step37_flickr8k.template.PATCH_END_TOKEN#

β€˜<patch_end>’

bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_TOKEN_COUNT#

169

bridge.data.vlm_datasets.step37_flickr8k.template.PATCH_TOKEN_COUNT#

81

bridge.data.vlm_datasets.step37_flickr8k.template.logger#

β€˜getLogger(…)’

bridge.data.vlm_datasets.step37_flickr8k.template._identity_path(path: str) str#
bridge.data.vlm_datasets.step37_flickr8k.template._expand_step37_image_placeholders(
text: str,
*,
image_token_count: int = IMAGE_TOKEN_COUNT,
patch_token_count: int = PATCH_TOKEN_COUNT,
image_token: str = IMAGE_TOKEN,
image_start_token: str = IMAGE_START_TOKEN,
image_end_token: str = IMAGE_END_TOKEN,
) str#

Expand a single <image> placeholder into <im_start> + <im_patch> Γ— 169 + <im_end> (or the multicrop variant for patches).

class bridge.data.vlm_datasets.step37_flickr8k.template.MultimodalSFTSample#

Bases: dict

Tokenized SFT sample whose length is the shifted LM training length.

len(sample) = tokens.numel() - 1 because the pack step uses tokens[:-1] / tokens[1:] shift-by-one.

Initialization

Initialize self. See help(type(self)) for accurate signature.

__len__() int#
bridge.data.vlm_datasets.step37_flickr8k.template._load_hf_tokenizer(tokenizer_path: str)#

Load a tokenizer from a local HF snapshot. trust_remote_code=False is hard-coded β€” we never execute custom HF Python code.

class bridge.data.vlm_datasets.step37_flickr8k.template.Step37MultimodalTemplate(
*,
tokenizer_path: str,
image_token_count: int,
patch_token_count: int,
image_token: str,
image_start_token: str,
image_end_token: str,
patch_start_token: str,
patch_end_token: str,
max_sequence_length: int,
path_rewrite_fn: Optional[collections.abc.Callable[[str], str]] = None,
)#

Step3.7 SFT tokenize template.

Expands <image> placeholders β†’ <im_start><im_patch>Γ—169<im_end> inside every user / tool turn, then runs tokenizer.apply_chat_template(messages, tokenize=True) to produce tokens (LongTensor). The loss_mask is set to 1 only on the assistant turn span(s), found by re-tokenizing the prefix up to and including each assistant turn.

Initialization

image_placeholder#

None

multicrop_image_placeholder#

None

multicrop_patch_placeholder#

None

_expand_image_placeholders(text: str) str#
_normalize_messages(
data: list[dict[str, Any]],
) list[dict[str, Any]]#
_apply_chat_template(
messages: list[dict[str, Any]],
) list[int]#
_normalize_images(
raw_images: Optional[list[Any]],
) list[tuple[str, int]]#
__call__(
data: dict,
) bridge.data.vlm_datasets.step37_flickr8k.template.MultimodalSFTSample#