nemo_curator.stages.image.io.image_writer

View as Markdown

Module Contents

Classes

NameDescription
ImageWriterStageWrite images to tar files and corresponding metadata to a Parquet file.

API

class nemo_curator.stages.image.io.image_writer.ImageWriterStage(
output_dir: str,
images_per_tar: int = 1000,
verbose: bool = False,
deterministic_name: bool = True,
remove_image_data: bool = False,
name: str = 'image_writer'
)
Dataclass

Bases: ProcessingStage[ImageBatch, FileGroupTask]

Write images to tar files and corresponding metadata to a Parquet file.

  • Images are packed into tar archives with at most images_per_tar entries each.
  • Metadata for all written images in the batch is stored in a single Parquet file.
  • Tar filenames are unique across actors via an actor-scoped prefix.
deterministic_name
bool = True
images_per_tar
int = 1000
name
str = 'image_writer'
output_dir
str
remove_image_data
bool = False
verbose
bool = False
nemo_curator.stages.image.io.image_writer.ImageWriterStage.__post_init__() -> None
nemo_curator.stages.image.io.image_writer.ImageWriterStage._encode_image_to_bytes(
image: numpy.ndarray
) -> tuple[bytes, str]

Encode image array to JPEG bytes; always returns (bytes, “.jpg”).

nemo_curator.stages.image.io.image_writer.ImageWriterStage._write_parquet(
base_name: str,
rows: list[dict[str, typing.Any]]
) -> str

Write metadata rows to a Parquet file for a specific tar and return its path.

The Parquet file shares the same base name as the tar file: {base_name}.parquet.

nemo_curator.stages.image.io.image_writer.ImageWriterStage._write_tar(
base_name: str,
members: list[tuple[str, bytes]]
) -> str

Write a tar file with given (member_name, bytes) entries using provided base name.

Returns tar path.

nemo_curator.stages.image.io.image_writer.ImageWriterStage.construct_base_name(
task: nemo_curator.tasks.image.ImageBatch
) -> str

Construct a base name for tar files within this actor.

nemo_curator.stages.image.io.image_writer.ImageWriterStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.image.io.image_writer.ImageWriterStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.image.io.image_writer.ImageWriterStage.process(
task: nemo_curator.tasks.image.ImageBatch
) -> nemo_curator.tasks.file_group.FileGroupTask