`bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader`#

Flickr8k (intro/flickr8k) sample loader — ported verbatim from playground/data/sft/step37/flickr8k_sft_data.py.

Downloads train/metadata.csv and the per-row train/<file_name>.jpg images via huggingface_hub.hf_hub_download (no transformers involved). Output: a list of :class:Flickr8kSample, then wrapped into

class:: Step37Flickr8kDataset for the tokenize step.

Module Contents#

Classes#

`Flickr8kSample`	Image-caption sample from the Flickr8k dataset.
`Step37Flickr8kDataset`	Step3.7 SFT dataset over the CC0 Flickr8k image-caption data.

Functions#

`get_flickr8k_dataset_file`	Download (or reuse cached) a single Flickr8k file via `hf_hub_download`.
`_is_global_rank_0`	Return True on global rank 0 (or when torch.distributed isn’t initialized).
`_maybe_barrier`	`torch.distributed.barrier()` if process group is up, else no-op.
`prepare_flickr8k_samples`	Download metadata.csv + the first `sample_count` images and build `Flickr8kSample` records.

API#

class bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.Flickr8kSample#

Image-caption sample from the Flickr8k dataset.

image_path: str#: None

caption: str#: None

class bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.Step37Flickr8kDataset( samples: list[bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.Flickr8kSample], template, prompt: str, )#

Step3.7 SFT dataset over the CC0 Flickr8k image-caption data.

Map-style torch.utils.data.Dataset (no inheritance — duck-typed): __len__ returns len(samples), __getitem__(idx) returns the tokenized :class:MultimodalSFTSample.

Initialization

__len__() → int#

__getitem__(idx: int)#

static _to_dialog( sample: bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.Flickr8kSample, prompt: str, ) → dict[str, Any]#

bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.get_flickr8k_dataset_file( *, repo_id: str, filename: str, cache_dir: pathlib.Path, ) → pathlib.Path#

Download (or reuse cached) a single Flickr8k file via hf_hub_download.

The download call is not multi-process safe — Hugging Face’s Xet client and _local_folder.read/write_download_metadata use per-file metadata locks that deadlock when N ranks race against the same cache_dir. Callers must serialise concurrent invocations (rank-0-only + torch.distributed.barrier() is the standard pattern — see :func:prepare_flickr8k_samples).

bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader._is_global_rank_0() → bool#: Return True on global rank 0 (or when torch.distributed isn’t initialized).

bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader._maybe_barrier() → None#: torch.distributed.barrier() if process group is up, else no-op.

bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.prepare_flickr8k_samples( *, repo_id: str = 'intro/flickr8k', split: str = 'train', sample_count: int | None = 8, caption_key: str = 'caption_0', cache_dir: str = '.cache/step37_flickr8k', ) → list[bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.Flickr8kSample]#

Download metadata.csv + the first sample_count images and build Flickr8kSample records.

sample_count defaults to 8. Pass None to take the full Flickr8k train split (~6000 rows, ~1 GB jpgs, slow on a cold cache).

Distributed-safety: the actual hf_hub_download calls only run on global rank 0; non-zero ranks wait on a torch.distributed.barrier until rank 0 has populated the cache, then they read the same files from disk. This avoids the multi-process deadlock seen when N ranks race huggingface_hub’s Xet + _local_folder.metadata locks against the same cache_dir (lustre / NFS shared filesystem).

bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader#

Module Contents#

Classes#

Functions#

API#

`bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader`#