nemo_curator.stages.interleaved.utils.materialization
nemo_curator.stages.interleaved.utils.materialization
Module Contents
Classes
Functions
Data
API
Bases: NamedTuple
direct_read
missing
range_read
tar_extract
Partition pending image rows into three I/O strategy groups.
- tar_extract: has member name but no byte_offset (must open tar and extractfile)
- range_read: has member + byte_offset + byte_size (can use fs.cat_ranges)
- direct_read: no member (path is the file itself)
- missing: path is None/NaN
Extract a single frame from a multi-frame TIFF, returning it as a single-frame TIFF.
Returns the raw bytes unchanged if the data is not a TIFF.
Read each direct file once, share bytes across all rows referencing it.
Batch byte-range reads per path using fs.cat_ranges(), deduplicating identical ranges.
Open each tar once and extract all needed members sequentially.
Distribute deduplicated range-read results, extracting TIFF frames as needed.
Return a task with image-row binary content materialized from source_ref.
Dispatches to three I/O strategies based on source_ref contents:
- range_read: byte_offset + byte_size present -> batched fs.cat_ranges()
- tar_extract: member present, no byte range -> open tar + extractfile
- direct_read: no member -> read file directly