nemo_curator.stages.interleaved.utils.materialization

View as Markdown

Module Contents

Classes

NameDescription
_ClassifiedRows-

Functions

NameDescription
_build_image_mask-
_classify_rowsPartition pending image rows into three I/O strategy groups.
_extract_tiff_frameExtract a single frame from a multi-frame TIFF, returning it as a single-frame TIFF.
_fill_direct_read_rowsRead each direct file once, share bytes across all rows referencing it.
_fill_materialized_bytes-
_fill_range_read_rowsBatch byte-range reads per path using fs.cat_ranges(), deduplicating identical ranges.
_fill_tar_extract_rowsOpen each tar once and extract all needed members sequentially.
_get_frame_index-
_init_materialization_buffers-
_read_direct_file-
_scatter_range_blobsDistribute deduplicated range-read results, extracting TIFF frames as needed.
_task_with_dataframe-
materialize_task_binary_contentReturn a task with image-row binary content materialized from source_ref.

Data

_TAR_EXTENSIONS

API

class nemo_curator.stages.interleaved.utils.materialization._ClassifiedRows()

Bases: NamedTuple

direct_read
dict[str, list[int]]
missing
list[int]
range_read
dict[str, list[tuple[int, str, int, int, int | None]]]
tar_extract
dict[str, list[tuple[int, str, int | None]]]
nemo_curator.stages.interleaved.utils.materialization._build_image_mask(
df: pandas.DataFrame,
only_missing_binary: bool,
image_content_types: tuple[str, ...] | None
) -> pandas.Series
nemo_curator.stages.interleaved.utils.materialization._classify_rows(
df: pandas.DataFrame,
image_mask: pandas.Series
) -> nemo_curator.stages.interleaved.utils.materialization._ClassifiedRows

Partition pending image rows into three I/O strategy groups.

  • tar_extract: has member name but no byte_offset (must open tar and extractfile)
  • range_read: has member + byte_offset + byte_size (can use fs.cat_ranges)
  • direct_read: no member (path is the file itself)
  • missing: path is None/NaN
nemo_curator.stages.interleaved.utils.materialization._extract_tiff_frame(
tiff_bytes: bytes,
frame_index: int
) -> bytes | None

Extract a single frame from a multi-frame TIFF, returning it as a single-frame TIFF.

Returns the raw bytes unchanged if the data is not a TIFF.

nemo_curator.stages.interleaved.utils.materialization._fill_direct_read_rows(
groups: dict[str, list[int]],
storage_options: dict[str, object],
binary_values: list[object],
error_values: list[str | None]
) -> None

Read each direct file once, share bytes across all rows referencing it.

nemo_curator.stages.interleaved.utils.materialization._fill_materialized_bytes(
df: pandas.DataFrame,
image_mask: pandas.Series,
storage_options: dict[str, object],
binary_values: list[object],
error_values: list[str | None]
) -> None
nemo_curator.stages.interleaved.utils.materialization._fill_range_read_rows(
groups: dict[str, list[tuple[int, str, int, int, int | None]]],
storage_options: dict[str, object],
binary_values: list[object],
error_values: list[str | None]
) -> None

Batch byte-range reads per path using fs.cat_ranges(), deduplicating identical ranges.

nemo_curator.stages.interleaved.utils.materialization._fill_tar_extract_rows(
groups: dict[str, list[tuple[int, str, int | None]]],
storage_options: dict[str, object],
binary_values: list[object],
error_values: list[str | None]
) -> None

Open each tar once and extract all needed members sequentially.

nemo_curator.stages.interleaved.utils.materialization._get_frame_index(
df: pandas.DataFrame,
idx: int
) -> int | None
nemo_curator.stages.interleaved.utils.materialization._init_materialization_buffers(
df: pandas.DataFrame
) -> tuple[list[object], list[str | None]]
nemo_curator.stages.interleaved.utils.materialization._read_direct_file(
path: str,
storage_options: dict[str, object]
) -> bytes | None
nemo_curator.stages.interleaved.utils.materialization._scatter_range_blobs(
blobs: list[object],
range_keys: list[tuple[int, int]],
unique_ranges: dict[tuple[int, int], list[tuple[int, str, int | None]]],
binary_values: list[object],
error_values: list[str | None]
) -> None

Distribute deduplicated range-read results, extracting TIFF frames as needed.

nemo_curator.stages.interleaved.utils.materialization._task_with_dataframe(
task: nemo_curator.tasks.InterleavedBatch,
df: pandas.DataFrame
) -> nemo_curator.tasks.InterleavedBatch
nemo_curator.stages.interleaved.utils.materialization.materialize_task_binary_content(
task: nemo_curator.tasks.InterleavedBatch,
io_kwargs: dict[str, object] | None = None,
only_missing_binary: bool = True,
image_content_types: tuple[str, ...] | None = None
) -> nemo_curator.tasks.InterleavedBatch

Return a task with image-row binary content materialized from source_ref.

Dispatches to three I/O strategies based on source_ref contents:

  • range_read: byte_offset + byte_size present -> batched fs.cat_ranges()
  • tar_extract: member present, no byte range -> open tar + extractfile
  • direct_read: no member -> read file directly
nemo_curator.stages.interleaved.utils.materialization._TAR_EXTENSIONS = ('.tar', '.tar.gz', '.tgz')