nemo_curator.tasks.document
Module Contents
Classes
API
Dataclass
Bases: Task[Table | DataFrame]
Task for processing batches of text documents. Documents are stored as a dataframe (PyArrow Table or Pandas DataFrame).
data
num_items
Get the number of documents in this batch.
Get column names from the data.
Convert data to Pandas DataFrame.
Convert data to PyArrow table.
Validate the task data.