`bridge.data.vlm_datasets.step37_flickr8k.template`#

Step3.7 multimodal SFT template.

Loads the tokenizer with transformers.AutoTokenizer.from_pretrained and trust_remote_code=False (so no custom HF Python code is executed). This is the only transformers-library use in this package; everything else is pure torch / huggingface_hub.

The tokenize path (apply_chat_template) uses the local chat_template.jinja shipped with step3p7_flash_bf16, which determines the token sequence produced for a given input dialog.

Module Contents#

Classes#

`MultimodalSFTSample`	Tokenized SFT sample whose length is the shifted LM training length.
`Step37MultimodalTemplate`	Step3.7 SFT tokenize template.

Functions#

`_identity_path`
`_expand_step37_image_placeholders`	Expand a single `<image>` placeholder into `<im_start>` + `<im_patch>` × 169 + `<im_end>` (or the multicrop variant for patches).
`_load_hf_tokenizer`	Load a tokenizer from a local HF snapshot. `trust_remote_code=False` is hard-coded — we never execute custom HF Python code.

Data#

`IMAGE_PLACEHOLDER`
`MULTICROP_IMAGE_PLACEHOLDER`
`MULTICROP_PATCH_PLACEHOLDER`
`IMAGE_TOKEN`
`IMAGE_START_TOKEN`
`IMAGE_END_TOKEN`
`PATCH_START_TOKEN`
`PATCH_END_TOKEN`
`IMAGE_TOKEN_COUNT`
`PATCH_TOKEN_COUNT`
`logger`

API#

bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_PLACEHOLDER#: ‘’

bridge.data.vlm_datasets.step37_flickr8k.template.MULTICROP_IMAGE_PLACEHOLDER#: ‘<@image@>’

bridge.data.vlm_datasets.step37_flickr8k.template.MULTICROP_PATCH_PLACEHOLDER#: ‘<#image#>’

bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_TOKEN#: ‘<im_patch>’

bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_START_TOKEN#: ‘<im_start>’

bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_END_TOKEN#: ‘<im_end>’

bridge.data.vlm_datasets.step37_flickr8k.template.PATCH_START_TOKEN#: ‘<patch_start>’

bridge.data.vlm_datasets.step37_flickr8k.template.PATCH_END_TOKEN#: ‘<patch_end>’

bridge.data.vlm_datasets.step37_flickr8k.template.IMAGE_TOKEN_COUNT#: 169

bridge.data.vlm_datasets.step37_flickr8k.template.PATCH_TOKEN_COUNT#: 81

bridge.data.vlm_datasets.step37_flickr8k.template.logger#: ‘getLogger(…)’

bridge.data.vlm_datasets.step37_flickr8k.template._identity_path(path: str) → str#

bridge.data.vlm_datasets.step37_flickr8k.template._expand_step37_image_placeholders( text: str, *, image_token_count: int = IMAGE_TOKEN_COUNT, patch_token_count: int = PATCH_TOKEN_COUNT, image_token: str = IMAGE_TOKEN, image_start_token: str = IMAGE_START_TOKEN, image_end_token: str = IMAGE_END_TOKEN, ) → str#: Expand a single <image> placeholder into <im_start> + <im_patch> × 169 + <im_end> (or the multicrop variant for patches).

class bridge.data.vlm_datasets.step37_flickr8k.template.MultimodalSFTSample#

Bases: dict

Tokenized SFT sample whose length is the shifted LM training length.

len(sample) = tokens.numel() - 1 because the pack step uses tokens[:-1] / tokens[1:] shift-by-one.

Initialization

Initialize self. See help(type(self)) for accurate signature.

__len__() → int#

bridge.data.vlm_datasets.step37_flickr8k.template._load_hf_tokenizer(tokenizer_path: str)#: Load a tokenizer from a local HF snapshot. trust_remote_code=False is hard-coded — we never execute custom HF Python code.

class bridge.data.vlm_datasets.step37_flickr8k.template.Step37MultimodalTemplate( *, tokenizer_path: str, image_token_count: int, patch_token_count: int, image_token: str, image_start_token: str, image_end_token: str, patch_start_token: str, patch_end_token: str, max_sequence_length: int, path_rewrite_fn: Optional[collections.abc.Callable[[str], str]] = None, )#

Step3.7 SFT tokenize template.

Expands <image> placeholders → <im_start><im_patch>×169<im_end> inside every user / tool turn, then runs tokenizer.apply_chat_template(messages, tokenize=True) to produce tokens (LongTensor). The loss_mask is set to 1 only on the assistant turn span(s), found by re-tokenizing the prefix up to and including each assistant turn.