nemo_curator.tasks.document
nemo_curator.tasks.document
nemo_curator.tasks.document
Bases: Task[Table | DataFrame]
Task for processing batches of text documents. Documents are stored as a dataframe (PyArrow Table or Pandas DataFrame).
Get the number of documents in this batch.
Get column names from the data.
Convert data to Pandas DataFrame.
Convert data to PyArrow table.
Validate the task data.