> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/curator/_mcp/server.

# nemo_curator.core.utils

## Module Contents

### Functions

| Name                                                                                        | Description                                                                 |
| ------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
| [`_logger_custom_deserializer`](#nemo_curator-core-utils-_logger_custom_deserializer)       | -                                                                           |
| [`_logger_custom_serializer`](#nemo_curator-core-utils-_logger_custom_serializer)           | -                                                                           |
| [`check_ray_responsive`](#nemo_curator-core-utils-check_ray_responsive)                     | -                                                                           |
| [`get_free_port`](#nemo_curator-core-utils-get_free_port)                                   | Checks if start\_port is free.                                              |
| [`ignore_ray_head_node`](#nemo_curator-core-utils-ignore_ray_head_node)                     | Return True if `CURATOR_IGNORE_RAY_HEAD_NODE` is set to a truthy value.     |
| [`init_cluster`](#nemo_curator-core-utils-init_cluster)                                     | Initialize a new local Ray cluster or connects to an existing one.          |
| [`split_table_by_group_max_bytes`](#nemo_curator-core-utils-split_table_by_group_max_bytes) | Split an Arrow table by approximate byte size without splitting group rows. |

### API

```python
nemo_curator.core.utils._logger_custom_deserializer(
    _: None
) -> loguru.Logger
```

```python
nemo_curator.core.utils._logger_custom_serializer(
    _: loguru.Logger
) -> None
```

```python
nemo_curator.core.utils.check_ray_responsive(
    timeout_s: int = RAY_CLUSTER_START_VERIFICAT...
) -> bool
```

```python
nemo_curator.core.utils.get_free_port(
    start_port: int,
    get_next_free_port: bool = True
) -> int
```

Checks if start\_port is free.
If not, it will get the next free port starting from start\_port if get\_next\_free\_port is True.
Else, it will raise an error if the free port is not equal to start\_port.

```python
nemo_curator.core.utils.ignore_ray_head_node() -> bool
```

Return True if `CURATOR_IGNORE_RAY_HEAD_NODE` is set to a truthy value.

Used by both the pipeline executors (to skip the head node when scheduling
stage actors) and the inference-server backends (to emit a worker-only
bundle-label selector on placement groups).

```python
nemo_curator.core.utils.init_cluster(
    ray_port: int,
    ray_temp_dir: str,
    ray_dashboard_port: int,
    ray_metrics_port: int,
    ray_client_server_port: int,
    ray_dashboard_host: str,
    num_gpus: int | None = None,
    num_cpus: int | None = None,
    object_store_memory: int | None = None,
    enable_object_spilling: bool = False,
    block: bool = True,
    ip_address: str | None = None,
    stdouterr_capture_file: str | None = None
) -> subprocess.Popen
```

Initialize a new local Ray cluster or connects to an existing one.

```python
nemo_curator.core.utils.split_table_by_group_max_bytes(
    table: pyarrow.Table,
    group_column: str,
    max_batch_bytes: int | None
) -> list[pyarrow.Table]
```

Split an Arrow table by approximate byte size without splitting group rows.

Each unique value in `group_column` is kept in a single output table.
If a single group exceeds `max_batch_bytes`, it is still emitted as one chunk.

Note: null values in `group_column` are grouped together (consecutive
nulls are not split).  Callers should ensure the column is non-nullable
or handle nulls upstream.