bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader#
Flickr8k (intro/flickr8k) sample loader — ported verbatim from
playground/data/sft/step37/flickr8k_sft_data.py.
Downloads train/metadata.csv and the per-row train/<file_name>.jpg
images via huggingface_hub.hf_hub_download (no transformers
involved). Output: a list of :class:Flickr8kSample, then wrapped into
- class:
Step37Flickr8kDatasetfor the tokenize step.
Module Contents#
Classes#
Image-caption sample from the Flickr8k dataset. |
|
Step3.7 SFT dataset over the CC0 Flickr8k image-caption data. |
Functions#
Download (or reuse cached) a single Flickr8k file via |
|
Return True on global rank 0 (or when torch.distributed isn’t initialized). |
|
|
|
Download metadata.csv + the first |
API#
- class bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.Flickr8kSample#
Image-caption sample from the Flickr8k dataset.
- image_path: str#
None
- caption: str#
None
- class bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.Step37Flickr8kDataset(
- samples: list[bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.Flickr8kSample],
- template,
- prompt: str,
Step3.7 SFT dataset over the CC0 Flickr8k image-caption data.
Map-style
torch.utils.data.Dataset(no inheritance — duck-typed):__len__returnslen(samples),__getitem__(idx)returns the tokenized :class:MultimodalSFTSample.Initialization
- __len__() int#
- __getitem__(idx: int)#
- static _to_dialog(
- sample: bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.Flickr8kSample,
- prompt: str,
- bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.get_flickr8k_dataset_file(
- *,
- repo_id: str,
- filename: str,
- cache_dir: pathlib.Path,
Download (or reuse cached) a single Flickr8k file via
hf_hub_download.The download call is not multi-process safe — Hugging Face’s Xet client and
_local_folder.read/write_download_metadatause per-file metadata locks that deadlock when N ranks race against the samecache_dir. Callers must serialise concurrent invocations (rank-0-only +torch.distributed.barrier()is the standard pattern — see :func:prepare_flickr8k_samples).
- bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader._is_global_rank_0() bool#
Return True on global rank 0 (or when torch.distributed isn’t initialized).
- bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader._maybe_barrier() None#
torch.distributed.barrier()if process group is up, else no-op.
- bridge.data.vlm_datasets.step37_flickr8k.flickr8k_loader.prepare_flickr8k_samples(
- *,
- repo_id: str = 'intro/flickr8k',
- split: str = 'train',
- sample_count: int | None = 8,
- caption_key: str = 'caption_0',
- cache_dir: str = '.cache/step37_flickr8k',
Download metadata.csv + the first
sample_countimages and buildFlickr8kSamplerecords.sample_countdefaults to8. PassNoneto take the full Flickr8k train split (~6000 rows, ~1 GB jpgs, slow on a cold cache).Distributed-safety: the actual
hf_hub_downloadcalls only run on global rank 0; non-zero ranks wait on atorch.distributed.barrieruntil rank 0 has populated the cache, then they read the same files from disk. This avoids the multi-process deadlock seen when N ranks racehuggingface_hub’s Xet +_local_folder.metadatalocks against the samecache_dir(lustre / NFS shared filesystem).