bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader#

Flickr8k (intro/flickr8k) sample loader — ported verbatim from playground/data/sft/step37/flickr8k_sft_data.py.

Downloads train/metadata.csv and the per-row train/<file_name>.jpg images via huggingface_hub.hf_hub_download (no transformers involved). Output: a list of :class:Flickr8kSample, then wrapped into

class:

Step37Flickr8kDataset for the tokenize step.

Module Contents#

Classes#

Flickr8kSample

Image-caption sample from the Flickr8k dataset.

Step37Flickr8kDataset

Step3.7 SFT dataset over the CC0 Flickr8k image-caption data.

Functions#

get_flickr8k_dataset_file

Download (or reuse cached) a single Flickr8k file via hf_hub_download.

_is_global_rank_0

Return True on global rank 0 (or when torch.distributed isn’t initialized).

_maybe_barrier

torch.distributed.barrier() if process group is up, else no-op.

prepare_flickr8k_samples

Download metadata.csv + the first sample_count images and build Flickr8kSample records.

API#

class bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.Flickr8kSample#

Image-caption sample from the Flickr8k dataset.

image_path: str#

None

caption: str#

None

class bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.Step37Flickr8kDataset(
samples: list[bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.Flickr8kSample],
template,
prompt: str,
)#

Step3.7 SFT dataset over the CC0 Flickr8k image-caption data.

Map-style torch.utils.data.Dataset (no inheritance — duck-typed): __len__ returns len(samples), __getitem__(idx) returns the tokenized :class:MultimodalSFTSample.

Initialization

__len__() int#
__getitem__(idx: int)#
static _to_dialog(
sample: bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.Flickr8kSample,
prompt: str,
) dict[str, Any]#
bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.get_flickr8k_dataset_file(
*,
repo_id: str,
filename: str,
cache_dir: pathlib.Path,
) pathlib.Path#

Download (or reuse cached) a single Flickr8k file via hf_hub_download.

The download call is not multi-process safe — Hugging Face’s Xet client and _local_folder.read/write_download_metadata use per-file metadata locks that deadlock when N ranks race against the same cache_dir. Callers must serialise concurrent invocations (rank-0-only + torch.distributed.barrier() is the standard pattern — see :func:prepare_flickr8k_samples).

bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader._is_global_rank_0() bool#

Return True on global rank 0 (or when torch.distributed isn’t initialized).

bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader._maybe_barrier() None#

torch.distributed.barrier() if process group is up, else no-op.

bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.prepare_flickr8k_samples(
*,
repo_id: str = 'intro/flickr8k',
split: str = 'train',
sample_count: int | None = 8,
caption_key: str = 'caption_0',
cache_dir: str = '.cache/step37_flickr8k',
) list[bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.Flickr8kSample]#

Download metadata.csv + the first sample_count images and build Flickr8kSample records.

sample_count defaults to 8. Pass None to take the full Flickr8k train split (~6000 rows, ~1 GB jpgs, slow on a cold cache).

Distributed-safety: the actual hf_hub_download calls only run on global rank 0; non-zero ranks wait on a torch.distributed.barrier until rank 0 has populated the cache, then they read the same files from disk. This avoids the multi-process deadlock seen when N ranks race huggingface_hub’s Xet + _local_folder.metadata locks against the same cache_dir (lustre / NFS shared filesystem).