nemo_curator.tasks.tasks

View as Markdown

Module Contents

Classes

NameDescription
TaskAbstract base class for tasks in the pipeline.
_EmptyTaskDummy task for testing.

Data

EmptyTask

T

API

class nemo_curator.tasks.tasks.Task(
task_id: str,
dataset_name: str,
data: nemo_curator.tasks.tasks.T,
_stage_perf: list[nemo_curator.utils.performance_utils.StagePerfStats] = list(),
_metadata: dict[str, typing.Any] = dict()
)
DataclassAbstract

Bases: Generic[T]

Abstract base class for tasks in the pipeline. A task represents a batch of data to be processed. Different modalities (text, audio, video) can implement their own task types. Attributes: task_id: Unique identifier for this task dataset_name: Name of the dataset this task belongs to dataframe_attribute: Name of the attribute that contains the dataframe data. We use this for input/output validations. _stage_perf: List of stages perfs this task has passed through

_metadata
dict[str, Any] = field(default_factory=dict)
_stage_perf
list[StagePerfStats] = field(default_factory=list)
_uuid
str
data
T
dataset_name
str
num_items
int

Get the number of items in this task.

task_id
str
nemo_curator.tasks.tasks.Task.__post_init__() -> None

Post-initialization hook.

nemo_curator.tasks.tasks.Task.__repr__() -> str
nemo_curator.tasks.tasks.Task.add_stage_perf(
perf_stats: nemo_curator.utils.performance_utils.StagePerfStats
) -> None

Add performance stats for a stage.

nemo_curator.tasks.tasks.Task.validate() -> bool
abstract

Validate the task data.

class nemo_curator.tasks.tasks._EmptyTask(
task_id: str,
dataset_name: str,
data: nemo_curator.tasks.tasks.T,
_stage_perf: list[nemo_curator.utils.performance_utils.StagePerfStats] = list(),
_metadata: dict[str, typing.Any] = dict()
)
Dataclass

Bases: Task[None]

Dummy task for testing.

num_items
int
nemo_curator.tasks.tasks._EmptyTask.validate() -> bool

Validate the task data.

nemo_curator.tasks.tasks.EmptyTask = _EmptyTask(task_id='empty', dataset_name='empty', data=None)
nemo_curator.tasks.tasks.T = TypeVar('T')