nemo_curator.tasks.document

Module Contents

Classes

Name	Description
`DocumentBatch`	Task for processing batches of text documents.

API

class nemo_curator.tasks.document.DocumentBatch(
    task_id: str,
    dataset_name: str,
    data: pyarrow.Table | pandas.DataFrame = pa.Table(),
    _stage_perf: list[nemo_curator.utils.performance_utils.StagePerfStats] = list(),
    _metadata: dict[str, typing.Any] = dict()
)

Dataclass

Bases: Task[Table | DataFrame]

Task for processing batches of text documents. Documents are stored as a dataframe (PyArrow Table or Pandas DataFrame).

data

Table | DataFrame = field(default_factory=(pa.Table))

num_items

int

Get the number of documents in this batch.

nemo_curator.tasks.document.DocumentBatch.get_columns() -> list[str]

Get column names from the data.

nemo_curator.tasks.document.DocumentBatch.to_pandas() -> pandas.DataFrame

Convert data to Pandas DataFrame.

nemo_curator.tasks.document.DocumentBatch.to_pyarrow() -> pyarrow.Table

Convert data to PyArrow table.

nemo_curator.tasks.document.DocumentBatch.validate() -> bool

Validate the task data.