> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/curator/_mcp/server.

# nemo_curator.tasks.document

## Module Contents

### Classes

| Name                                                          | Description                                    |
| ------------------------------------------------------------- | ---------------------------------------------- |
| [`DocumentBatch`](#nemo_curator-tasks-document-DocumentBatch) | Task for processing batches of text documents. |

### API

```python
class nemo_curator.tasks.document.DocumentBatch(
    task_id: str,
    dataset_name: str,
    data: pyarrow.Table | pandas.DataFrame = pa.Table(),
    _stage_perf: list[nemo_curator.utils.performance_utils.StagePerfStats] = list(),
    _metadata: dict[str, typing.Any] = dict()
)
```

Dataclass

**Bases:** [Task\[Table | DataFrame\]](/nemo-curator/nemo_curator/tasks/tasks#nemo_curator-tasks-tasks-Task)

Task for processing batches of text documents.
Documents are stored as a dataframe (PyArrow Table or Pandas DataFrame).

Get the number of documents in this batch.

```python
nemo_curator.tasks.document.DocumentBatch.get_columns() -> list[str]
```

Get column names from the data.

```python
nemo_curator.tasks.document.DocumentBatch.to_pandas() -> pandas.DataFrame
```

Convert data to Pandas DataFrame.

```python
nemo_curator.tasks.document.DocumentBatch.to_pyarrow() -> pyarrow.Table
```

Convert data to PyArrow table.

```python
nemo_curator.tasks.document.DocumentBatch.validate() -> bool
```

Validate the task data.