tasks.document#

Module Contents#

Classes#

DocumentBatch

Task for processing batches of text documents. Documents are stored as a dataframe (PyArrow Table or Pandas DataFrame).

API#

class tasks.document.DocumentBatch#

Bases: tasks.tasks.Task[pyarrow.Table | pandas.DataFrame]

Task for processing batches of text documents. Documents are stored as a dataframe (PyArrow Table or Pandas DataFrame).

data: pyarrow.Table | pandas.DataFrame#

‘field(…)’

get_columns() list[str]#

Get column names from the data.

property num_items: int#

Get the number of documents in this batch.

to_pandas() pandas.DataFrame#

Convert data to Pandas DataFrame.

to_pyarrow() pyarrow.Table#

Convert data to PyArrow table.

validate() bool#

Validate the task data.