nemo_curator.stages.image.io.image_reader

View as Markdown

Module Contents

Classes

NameDescription
ImageReaderStageDALI-based reader that loads images from WebDataset tar shards.

API

class nemo_curator.stages.image.io.image_reader.ImageReaderStage(
dali_batch_size: int = 100,
verbose: bool = True,
num_threads: int = 8,
num_gpus_per_worker: float = 0.25,
name: str = 'image_reader'
)
Dataclass

Bases: ProcessingStage[FileGroupTask, ImageBatch]

DALI-based reader that loads images from WebDataset tar shards.

Works with DALI GPU (CUDA) or DALI CPU; decodes on GPU if CUDA is available, otherwise falls back to CPU decoding.

dali_batch_size
int = 100
name
str = 'image_reader'
num_gpus_per_worker
float = 0.25
num_threads
int = 8
verbose
bool = True
nemo_curator.stages.image.io.image_reader.ImageReaderStage.__post_init__() -> None
nemo_curator.stages.image.io.image_reader.ImageReaderStage._create_dali_pipeline(
tar_paths: list[str]
) -> object
nemo_curator.stages.image.io.image_reader.ImageReaderStage._read_tars_with_dali(
tar_paths: list[pathlib.Path]
) -> collections.abc.Generator[list[nemo_curator.tasks.ImageObject], None, None]

Yield lists of ImageObject per DALI run over one or more tar files.

nemo_curator.stages.image.io.image_reader.ImageReaderStage._stream_batches(
tar_files: list[pathlib.Path]
) -> collections.abc.Generator[nemo_curator.tasks.ImageBatch, None, None]

Emit one ImageBatch per DALI run across all provided tar files.

nemo_curator.stages.image.io.image_reader.ImageReaderStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.image.io.image_reader.ImageReaderStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.image.io.image_reader.ImageReaderStage.process(
task: nemo_curator.tasks.FileGroupTask
) -> list[nemo_curator.tasks.ImageBatch]
nemo_curator.stages.image.io.image_reader.ImageReaderStage.ray_stage_spec() -> dict[str, typing.Any]

Ray stage specification for this stage.