nemo_curator.stages.client_partitioning

View as Markdown

Module Contents

Classes

NameDescription
ClientPartitioningStageStage that partitions input file paths from a client into FileGroupTasks.

Functions

NameDescription
_read_list_json_relRead JSON list (via fsspec) and return entries relative to root.

API

class nemo_curator.stages.client_partitioning.ClientPartitioningStage(
file_paths: str | list[str],
files_per_partition: int | None = None,
blocksize: int | str | None = None,
file_extensions: list[str] | None = None,
storage_options: dict[str, typing.Any] | None = None,
limit: int | None = None,
name: str = 'client_partitioning',
input_list_json_path: str | None = None
)
Dataclass

Bases: FilePartitioningStage

Stage that partitions input file paths from a client into FileGroupTasks.

This stage runs as a dedicated processing stage (not on the driver) and creates file groups based on the partitioning strategy.

_fs
AbstractFileSystem | None = field(default=None, init=False, repr=False)
_root
str | None = field(default=None, init=False, repr=False)
input_list_json_path
str | None = None
name
str = 'client_partitioning'
nemo_curator.stages.client_partitioning.ClientPartitioningStage._list_relative() -> list[str]

Return sorted, de-duplicated list of paths relative to root.

nemo_curator.stages.client_partitioning.ClientPartitioningStage.process(
_: nemo_curator.tasks._EmptyTask
) -> list[nemo_curator.tasks.FileGroupTask]
nemo_curator.stages.client_partitioning.ClientPartitioningStage.setup(
worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
nemo_curator.stages.client_partitioning._read_list_json_rel(
root: str,
json_url: str,
storage_options: dict[str, typing.Any]
) -> list[str]

Read JSON list (via fsspec) and return entries relative to root. Validates each entry is under root.